HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection

doi:10.1016/j.cose.2021.102584

Computers & Security

Volume 114, March 2022, 102584

https://doi.org/10.1016/j.cose.2021.102584 Get rights and content

Abstract

Phishing has become a prevailing method for attackers to steal users’ private data and commit fraud, posing a serious threat to Internet users. How to detect phishing websites has attracted great interests from both academia and industry. A popular approach is to use support vector machine (SVM) to detect phishing websites. However, this approach relies on extracting features designated by experts, and the prediction effectiveness of the model is greatly affected by the quality of feature extraction. In addition, it cannot handle features that are not identifiable. Deep learning methods therefore become popular as they do not require manual feature engineering. However, many deep learning methods can only learn feature information of uniform resource locators (URLs) at the character level, while ignoring the intrinsic connections of words. To address these limitations, we propose a novel highway deep pyramid convolution neural network (HDP-CNN), a deep convolutional network that combines character-level and word-level representation information. HDP-CNN first receives the URL string sequences as input, then performs character-level embedding and word-level embedding respectively. Afterward, it uses the Highway network to connect the character-level embedding representation and word-level embedding representation of the URL and extracts local features of different sizes from the region embedding layer. Finally, it passes them into the designed deep pyramid structure network to capture the global representation of the URL. Our experiments illustrate that the information expressed by embedding vectors of different granularities has subtle differences. By combining embedding feature information of different granularities, HDP-CNN exhibits better performance than methods based on single embedding feature information. In our experiments, we construct an imbalanced dataset that has the ratio of benign websites to phishing websites is close to 5:1. The experimental results demonstrate that our method outperforms other methods, with accuracy at 98.30%, true positive rate (TPR) at 99.18%, and true negative rate (TNR) at 94.34%.

Introduction

The development of the Internet has brought great social and economic progress, and the Internet has become an indispensable infrastructure. Unfortunately, technological advances have been accompanied by many complex security issues. In the hands of criminals, technology has been used to attack and defraud users. These include phishing, financial fraud, malware, privacy breaches, etc., which pose serious threats to Internet users (Wang et al., 2020).

As defined by the anti-phishing working group (APWG), phishing is a criminal act which uses social engineering and technology to steal users personal identity data and financial accounts by using spoofed uniform resource locator (URL) addresses and emails to lure users to fake websites and steal their accounts information and passwords (APWG, 2020). Those fake URLs used for cyber attacks and scams are called phishing URLs. These phishing URLs are also using secure sockets layer (SSL) or transport layer security (TLS) certificates to induce users into thinking websites are legitimate. The losses caused by phishing attacks every year are enormous. Therefore, many researchers and practitioners have been working to design a more effective method to detect phishing URLs.

Methods based on machine learning (Sahoo et al., 2019) are widely used for phishing website detection. Machine learning-based approaches require extracting those feature representations from URLs that can contribute to discern, and then establish a predictive model to train the data represented by these features. This requires researchers to have the relevant domain knowledge to extract relevant features, and different feature extraction methods will lead to different training results.

However, deep learning methods such as convolutional neural network (Kim, 2014) and long short-term memory network (Bahnsen et al., 2017) can automatically discover hidden features from the original URL for training. These methods do not require manual extraction of functions from URLs, and are less based on data pre-processing since neural networks are capable of extracting higher-level information from raw data. In general, deep learning methods may perform better compared with traditional machine learning methods. Yet precisely extracting semantic information from the URL’s character sequence is also a tough task for deep learning methods.

In this paper, we propose a novel highway deep pyramid convolution neural network (HDP-CNN) that combines character-level and word-level representations to detect phishing websites, while most methods use only character-level information, which enables our method to obtain richer information from URL strings and thus improve the detection of the model. Specifically, HDP-CNN takes URL strings as input, and performs character-level and word-level embedding. Then, the character embedding matrix and word embedding matrix are concatenated as the semantic representation of the URL, and a highway network is used to balance the weight of both. Then, we feed them into the CNN with different sizes of convolution kernels to extract local information of different lengths. Next, we connect the features extracted from different convolution kernels and input them into DPCNN to capture the global representation of the URL. Finally, through the fully connected layer, the result of whether the URL is a phishing website is produced. Our model uses word-level embedding combined with character-level embedding representation, which can not only overcome the influence of out-of-vocabulary words in word-level representation that cannot be processed, but also the weakness that character-level representation does not work well when dealing with long sentences. In addition, with the help of the highway network, can prevents character-level information from being overwhelmed by word-level information because the word list is much larger than the character list. Moreover, the DPCNN network structure can make the network layers deeper without adding much computational cost, which makes the model training converge quickly.

The main contributions of our work are as follows:

1.
We propose a novel highway deep pyramid convolution neural network (HDP-CNN) which is deep convolutional network that combines both character-level and word-level representation information to predict whether a given URL is a phishing or a benign website.
2.
We construct an imbalanced datasets to verify the performance of HDP-CNN. The datasets contains nearly 420,000 samples and the ratio of positive to negative samples is approximately 5:1.
3.
Our model was trained on a real datasets and the experimental results show that HDP-CNN outperforms other methods, with accuracy at 98.30%, true positive rate (TPR) at 99.18%, and true negative rate (TNR) at 94.34%.

The rest of this paper is organized as follows. Section 2 reviews the related work on phishing website detection. Section 3 describes the framework and details on our method, and Section 4 presents our experiments and make comparisons with other methods. Section 5 summarizes the paper with future work.

Section snippets

Related work

There has been many excellent researches in solving the problem of phishing website detection. We describe a deep learning method based phishing website detection method, which is a hot spot in current phishing website detection research.

Deep learning methods have achieved good results in many text-related tasks, such as text classification and machine translation. In the detection of phishing websites, we can treat URL strings as text sequences, use deep learning methods to learn feature

Problem formulation

Our goal is to predict whether a given URL is a phishing or a benign website. Therefore, we define the task as a classification problem. Specifically, for a given URLs dataset $U$ , where each entry contains a URL string and a corresponding label (phishing or benign), $U = {(x, y) | x = x_{i}, y = y_{i}, i \in N}$ , $x$ represents the URL string in the data set, $y \in {0, 1}$ represents the tag corresponding to the URL, where $y_{i} = 0$ represents the benign website, $y_{i} = 1$ represents the phishing website, and $N$ means there are $N$ data

Dataset and matrics

We crawled a large amount of real data from the Internet from PhishTank and Alexa websites to build a dataset containing nearly 420,000 samples. We divided the datasets into a training set, a validation set and a test set. The ratio of the three datasets is 7: 1.5: 1.5, that is, we will use 70% of the data to train HDP-CNN model, so the model can learn enough URL features to be able to generalize and predict whether an unknown URL in the test data is a phishing website.

In addition, the

Conclusion and future work

In this paper, we propose the HDP-CNN method for detecting phishing websites, which is a deep convolutional neural network that combines URL word-level representation and character-level representation. It consists of four modules: embedding module, highway network module, region embedding module and deep pyramid module. In addition, we constructed an imbalanced datasets that is close to the ratio of benign websites to phishing websites in the real internet environment to verify the performance

CRediT authorship contribution statement

Faan Zheng: Conceptualization, Methodology, Validation. Qiao Yan: Data curation, Writing – original draft. Victor C.M. Leung: Writing – review & editing. F. Richard Yu: Writing – review & editing. Zhong Ming: Supervision.

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61976142, 61836005, 61672358) and Shenzhen Science and Technology Plan Project (JCYJ20210324093609025).

Faan Zheng is a graduate student in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. His research interests are in network security and machine learning. Email: [email protected]

References (21)

H. Wang et al.
The impact of propagation delay to different selfish miners in proof-of-work blockchains
Peer-to-Peer Networking and Applications
(2020)
A. Aljofey et al.
An effective phishing detection model based on character level convolutional neural network from url
Electronics
(2020)
APWG, 2020. Phishing activity trends report, 2nd quarter 2020....
A.C. Bahnsen et al.
Classifying phishing urls using recurrent neural networks
2017 APWG Symposium on Electronic Crime Research (eCrime)
(2017)
S.-J. Bu et al.
Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url detection
Electronics
(2021)
S.-J. Bu et al.
Integrating deep learning with first-order logic programmed constraints for zero-day phishing attack detection
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2021)
Y. Huang et al.
Phishing url detection via capsule-based neural network
2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID)
(2019)
R. Johnson et al.
Deep pyramid convolutional neural networks for text categorization
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(2017)
Y. Kim
Convolutional neural networks for sentence classification
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
(2014)
Le, H., Pham, Q., Sahoo, D., Hoi, S. C. H., 2018. Urlnet: Learning a url representation with deep learning for...

There are more references available in the full text version of this article.

Cited by (17)

PhishHunter: Detecting camouflaged IDN-based phishing attacks via Siamese neural network
2024, Computers and Security
Phishing is one of the significant threats to cybersecurity today, especially when attackers create Internationalized Domain Names (IDN) homographs to engage in phishing activities. IDN homograph takes advantage of some characters in different native languages in internationalized domain names that look similar to legitimate ones. Although researchers have proposed several enlightening detection methods, most of them focused on detecting typosquatting domain names. The ones focused on IDN homograph attack detection either need to enhance the generalization ability or improve detection performance caused by data imbalance. In this paper, we devised a Generative Adversarial Network with a Gradient Penalty (WGAN-GP) algorithm to solve the data imbalance problem. We transform domain names into images and calculate their similarity by Siamese neural networks. Our work can identify whether a domain name is IDN homograph or not effectively. We use the dataset generated based on Unicode tables, publicly available homograph tools, and the Internet traffic captured from the China Education Research Network backbone (CERNET) to evaluate the performance. Experimental results show that the proposed method improves the accuracy and reduces the false positive rate in detecting homograph domain names. In addition, it can also accurately identify typosquatting in phishing pages.
Detect malicious websites by building a neural network to capture global and local features of websites
2024, Computers and Security
With the development of the digital age, the Internet has become an integral part of our daily lives. However, it has also brought about a series of security challenges, among which malicious websites are particularly prominent. These websites often lure ignorant users by disguising themselves as legitimate services or through various fraudulent means to commit identity theft, distribute malware, or launch other forms of cyberattacks. Therefore the detection of malicious websites is very necessary. Traditionally, many malicious website detection methods rely on machine learning techniques, some of which require manual extraction of features, which may result in a time-consuming prediction process. Despite the existence of machine learning models that can automatically extract features, including unsupervised ones, capturing the subtleties of malicious website features is still a challenge. In recent years, deep learning has been gaining attention as a method for automated feature learning. It is capable of capturing and understanding the content of a website in greater depth, thus making classification and detection more accurate and efficient. Although deep learning shows its potential in capturing advanced features, its performance depends on the input data and the chosen model architecture. Both efficiently constructing feature representations of input data and building efficient model architectures to capture features are currently major challenges. For this reason, we propose a new approach for malicious website detection. This method uses wordpiece-level features to represent the information of malicious websites. Combination of multi-filter text convolutional neural network and multi-head self-attention mechanism is used for model construction. This enables the model to capture both global and local features of the input data. Compared to common deep learning methods, our approach captures the features of malicious websites better.
CNN-Fusion: An effective and lightweight phishing detection method based on multi-variant ConvNet
2023, Information Sciences
Phishing scams are increasing as the technical skills and costs of phishing attacks diminish, emphasizing the need for rapid, precise, and low-cost prevention measures. Based on a character-level convolutional neural network (CNN), we present CNN-Fusion, an effective and lightweight phishing URL detection method. Our basic idea is to deploy multiple variants of one-layer CNN with various-sized kernels in parallel to extract multi-level features. Observing that differences between phishing and benign URLs might exhibit a strong spatial correlation, we choose SpatialDropout1D, making the model more robust and preventing it from memorizing the training data. To further reduce the probability of errors that may cause by irrelevant or noisy features, we apply a max-over time pooling technique over the feature map to pick only the most important feature. Finally, the model is evaluated using five publicly available datasets containing 1.85 million phishing and benign URLs. Other than that, we assess the model against AI adversarial attacks, known as “Offensive AI.” Compared to existing methods, experiments demonstrate that our approach enjoys advantages in 5 times less training time and much more in memory consumption, achieving an average accuracy above 99% on five different datasets as well as on AI-generated malicious attacks.
Deep learning based phishing website detection
2024, Telkomnika (Telecommunication Computing Electronics and Control)
A Biological Immunity-Based Neuro Prototype for Few-Shot Anomaly Detection with Character Embedding
2024, Cyborg and Bionic Systems
Forecasting Stability of Smart Grids using Highway Deep Pyramid Convolutional Neural Network (HPDCNN) Approach
2024, International Journal of Intelligent Systems and Applications in Engineering

View all citing articles on Scopus

Qiao Yan is a Professor in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. Her research interests are in network security, software-defined networking and machine learning. Email: [email protected]

Victor C. M. Leung (Fellow, IEEE) is currently a Distinguished Professor of computer science and software engineering with Shenzhen University, Shenzhen, China. He is also an Emeritus Professor of electrical and computer engineering and the Director of the Laboratory for Wireless Networks and Mobile Systems, The University of British Columbia (UBC), Vancouver, Canada. He is a fellow of the Royal Society of Canada, Canadian Academy of Engineering, and Engineering Institute of Canada. His research is in the broad areas of wireless networks and mobile systems. Email: [email protected]

View full text

HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection

Abstract

Introduction

Section snippets

Related work

Problem formulation

Dataset and matrics

Conclusion and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Peer-to-Peer Networking and Applications

An effective phishing detection model based on character level convolutional neural network from url

Electronics

Classifying phishing urls using recurrent neural networks

2017 APWG Symposium on Electronic Crime Research (eCrime)

Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url detection

Electronics

Integrating deep learning with first-order logic programmed constraints for zero-day phishing attack detection

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Phishing url detection via capsule-based neural network

2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID)

Deep pyramid convolutional neural networks for text categorization

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

Convolutional neural networks for sentence classification

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)