HDP-CNN: Highway deep pyramid convolution neural network combining word-level and character-level representations for phishing website detection
Introduction
The development of the Internet has brought great social and economic progress, and the Internet has become an indispensable infrastructure. Unfortunately, technological advances have been accompanied by many complex security issues. In the hands of criminals, technology has been used to attack and defraud users. These include phishing, financial fraud, malware, privacy breaches, etc., which pose serious threats to Internet users (Wang et al., 2020).
As defined by the anti-phishing working group (APWG), phishing is a criminal act which uses social engineering and technology to steal users personal identity data and financial accounts by using spoofed uniform resource locator (URL) addresses and emails to lure users to fake websites and steal their accounts information and passwords (APWG, 2020). Those fake URLs used for cyber attacks and scams are called phishing URLs. These phishing URLs are also using secure sockets layer (SSL) or transport layer security (TLS) certificates to induce users into thinking websites are legitimate. The losses caused by phishing attacks every year are enormous. Therefore, many researchers and practitioners have been working to design a more effective method to detect phishing URLs.
Methods based on machine learning (Sahoo et al., 2019) are widely used for phishing website detection. Machine learning-based approaches require extracting those feature representations from URLs that can contribute to discern, and then establish a predictive model to train the data represented by these features. This requires researchers to have the relevant domain knowledge to extract relevant features, and different feature extraction methods will lead to different training results.
However, deep learning methods such as convolutional neural network (Kim, 2014) and long short-term memory network (Bahnsen et al., 2017) can automatically discover hidden features from the original URL for training. These methods do not require manual extraction of functions from URLs, and are less based on data pre-processing since neural networks are capable of extracting higher-level information from raw data. In general, deep learning methods may perform better compared with traditional machine learning methods. Yet precisely extracting semantic information from the URL’s character sequence is also a tough task for deep learning methods.
In this paper, we propose a novel highway deep pyramid convolution neural network (HDP-CNN) that combines character-level and word-level representations to detect phishing websites, while most methods use only character-level information, which enables our method to obtain richer information from URL strings and thus improve the detection of the model. Specifically, HDP-CNN takes URL strings as input, and performs character-level and word-level embedding. Then, the character embedding matrix and word embedding matrix are concatenated as the semantic representation of the URL, and a highway network is used to balance the weight of both. Then, we feed them into the CNN with different sizes of convolution kernels to extract local information of different lengths. Next, we connect the features extracted from different convolution kernels and input them into DPCNN to capture the global representation of the URL. Finally, through the fully connected layer, the result of whether the URL is a phishing website is produced. Our model uses word-level embedding combined with character-level embedding representation, which can not only overcome the influence of out-of-vocabulary words in word-level representation that cannot be processed, but also the weakness that character-level representation does not work well when dealing with long sentences. In addition, with the help of the highway network, can prevents character-level information from being overwhelmed by word-level information because the word list is much larger than the character list. Moreover, the DPCNN network structure can make the network layers deeper without adding much computational cost, which makes the model training converge quickly.
The main contributions of our work are as follows:
- 1.
We propose a novel highway deep pyramid convolution neural network (HDP-CNN) which is deep convolutional network that combines both character-level and word-level representation information to predict whether a given URL is a phishing or a benign website.
- 2.
We construct an imbalanced datasets to verify the performance of HDP-CNN. The datasets contains nearly 420,000 samples and the ratio of positive to negative samples is approximately 5:1.
- 3.
Our model was trained on a real datasets and the experimental results show that HDP-CNN outperforms other methods, with accuracy at 98.30%, true positive rate (TPR) at 99.18%, and true negative rate (TNR) at 94.34%.
The rest of this paper is organized as follows. Section 2 reviews the related work on phishing website detection. Section 3 describes the framework and details on our method, and Section 4 presents our experiments and make comparisons with other methods. Section 5 summarizes the paper with future work.
Section snippets
Related work
There has been many excellent researches in solving the problem of phishing website detection. We describe a deep learning method based phishing website detection method, which is a hot spot in current phishing website detection research.
Deep learning methods have achieved good results in many text-related tasks, such as text classification and machine translation. In the detection of phishing websites, we can treat URL strings as text sequences, use deep learning methods to learn feature
Problem formulation
Our goal is to predict whether a given URL is a phishing or a benign website. Therefore, we define the task as a classification problem. Specifically, for a given URLs dataset , where each entry contains a URL string and a corresponding label (phishing or benign), , represents the URL string in the data set, represents the tag corresponding to the URL, where represents the benign website, represents the phishing website, and means there are data
Dataset and matrics
We crawled a large amount of real data from the Internet from PhishTank and Alexa websites to build a dataset containing nearly 420,000 samples. We divided the datasets into a training set, a validation set and a test set. The ratio of the three datasets is 7: 1.5: 1.5, that is, we will use 70% of the data to train HDP-CNN model, so the model can learn enough URL features to be able to generalize and predict whether an unknown URL in the test data is a phishing website.
In addition, the
Conclusion and future work
In this paper, we propose the HDP-CNN method for detecting phishing websites, which is a deep convolutional neural network that combines URL word-level representation and character-level representation. It consists of four modules: embedding module, highway network module, region embedding module and deep pyramid module. In addition, we constructed an imbalanced datasets that is close to the ratio of benign websites to phishing websites in the real internet environment to verify the performance
CRediT authorship contribution statement
Faan Zheng: Conceptualization, Methodology, Validation. Qiao Yan: Data curation, Writing – original draft. Victor C.M. Leung: Writing – review & editing. F. Richard Yu: Writing – review & editing. Zhong Ming: Supervision.
Declaration of Competing Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (61976142, 61836005, 61672358) and Shenzhen Science and Technology Plan Project (JCYJ20210324093609025).
Faan Zheng is a graduate student in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. His research interests are in network security and machine learning. Email: [email protected]
References (21)
- et al.
The impact of propagation delay to different selfish miners in proof-of-work blockchains
Peer-to-Peer Networking and Applications
(2020) - et al.
An effective phishing detection model based on character level convolutional neural network from url
Electronics
(2020) - APWG, 2020. Phishing activity trends report, 2nd quarter 2020....
- et al.
Classifying phishing urls using recurrent neural networks
2017 APWG Symposium on Electronic Crime Research (eCrime)
(2017) - et al.
Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url detection
Electronics
(2021) - et al.
Integrating deep learning with first-order logic programmed constraints for zero-day phishing attack detection
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(2021) - et al.
Phishing url detection via capsule-based neural network
2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID)
(2019) - et al.
Deep pyramid convolutional neural networks for text categorization
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(2017) Convolutional neural networks for sentence classification
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
(2014)- Le, H., Pham, Q., Sahoo, D., Hoi, S. C. H., 2018. Urlnet: Learning a url representation with deep learning for...
Cited by (17)
PhishHunter: Detecting camouflaged IDN-based phishing attacks via Siamese neural network
2024, Computers and SecurityDetect malicious websites by building a neural network to capture global and local features of websites
2024, Computers and SecurityCNN-Fusion: An effective and lightweight phishing detection method based on multi-variant ConvNet
2023, Information SciencesDeep learning based phishing website detection
2024, Telkomnika (Telecommunication Computing Electronics and Control)A Biological Immunity-Based Neuro Prototype for Few-Shot Anomaly Detection with Character Embedding
2024, Cyborg and Bionic SystemsForecasting Stability of Smart Grids using Highway Deep Pyramid Convolutional Neural Network (HPDCNN) Approach
2024, International Journal of Intelligent Systems and Applications in Engineering
Faan Zheng is a graduate student in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. His research interests are in network security and machine learning. Email: [email protected]
Qiao Yan is a Professor in the College of Computer Science and Software Engineering at Shenzhen University, Shenzhen, China. Her research interests are in network security, software-defined networking and machine learning. Email: [email protected]
Victor C. M. Leung (Fellow, IEEE) is currently a Distinguished Professor of computer science and software engineering with Shenzhen University, Shenzhen, China. He is also an Emeritus Professor of electrical and computer engineering and the Director of the Laboratory for Wireless Networks and Mobile Systems, The University of British Columbia (UBC), Vancouver, Canada. He is a fellow of the Royal Society of Canada, Canadian Academy of Engineering, and Engineering Institute of Canada. His research is in the broad areas of wireless networks and mobile systems. Email: [email protected]