Full length articleDNS anti-attack machine learning model for DGA domain name detection
Introduction
Domain Name System (DNS) [1], [2] as an Internet vital service provides a mapping from domain name to IP address, it enables users to easily access Internet resources using domain name rather than IP address. DNS relay on distributed databases that store mapping values to supply the significant service to Internet. However some of those values in the database may be malicious domain names offered by attacker. These malicious domain names hidden in DNS portend great security risks to the Internet. For example, malicious domain name may be utilized while the attacker manipulates Botnets to lunch a large-scale attack against DNS server, causing the interruption or paralysis of DNS service at some time.
In general, the traditional detection methods for malicious domain name are based on the rule matching. Increasing malicious domain names are blacklisted continually. Nowadays blacklisting of malicious domain names is no longer sufficient to prevent the malware from successfully commanding and controlling Botnets. In fact, the attackers apply Domain Generation Algorithms (DGA) [3], [4] to generate a large number of malicious domain names that are hidden in distributed database of DNS. The attacks occur possibly only when malicious domain names are activated. DGA can generate thousands of malicious domain names every day. Both the malware and the botmaster can dynamically generate more domain names, repeating the process while needed because the generated domain names are disposable. Most of DGA domain names will not be activated for a long time. Only a small number of them are chosen as the follow-up attack domain names, then it is more difficult to detect than the typical malicious domain names. The traditional detection method for malicious domain name is difficult to cope with the massive domain names generated by DGA.
Recently, Artificial Intelligence (AI) has made breakthroughs, covering a variety of machine learning algorithms and deep learning algorithms [5]. These algorithms have been widely used in many fields. Machine Learning can solve a lot of problems of specific classification and recognition. At present, some researchers begin to apply machine-learning methods to detect the DGA domain name automatically. For example, a method of DGA domain name detection using deep learning method is described in [3]. Its main contribution is to propose a heuristic tagging method in order to obtain a large number of labeled samples for deep learning. In this paper, we research the feature extraction algorithm for DGA domain name detection, and match with machine learning algorithm to find the suitable classification model for the automatic detection of DGA domain name.
There are a lot of common algorithms in Machine Learning. The algorithm selection depends largely on the training set and its characteristics, and is closely related to the classification and recognition target, especially the specific usage scenario. According to detection requirements and DGA characteristics, the experiment in this paper is completed based on the six kinds of classification algorithms. They are Naive Bayesian (NB) [6], [7], [8], Extreme Gradient Boosting (XGBoost) [9], [10], Multi-Layer Perceptron (MLP) [11], [12], Long short-term memory (LSTM) [13], [14], [15], Random Forest (RF) [16], [17], [18] as well as Support Vector Machine (SVM) [19], [20], [21]. We proposed five types of features for DGA domain name detection, names character feature, Unicode feature and word-bag model [22], [23] in 2-gram, 3-gram and 4-gram.
The contributions of this paper are as follow: We match six algorithms with five features to obtain the optimal DGA detection model. In order to get the optimal model, we have trained thirty candidate detection models based on the 6*5 matching. Moreover, each candidate model is acquired through a large number of comparative experiments with parameter adjustment. The evaluation indexes on the model are Accuracy, Precision, Recall, F1 score and Training Time. The comparative analysis with DGA candidate detection models are given in detail and the optimized DGA detection model is showed.
The paper is organized as follows. In the next section, we describe the collection of DGA samples and the extraction of the DGA features. In Section 3, the method of training model for detection DGA is presented. In Section 4, we describe experimental chart, model evaluation Indicators and procedures of experimental environment. Result and discuss are placed in Section 5. Finally, in Section 6, we give a conclusion for this paper and the work in the future.
Section snippets
DGA features analysis
The collection of samples and the quality of the samples are critical and directly determine the effectiveness of Machine Learning. In order to obtain a machine learning-based DGA domain name detection classifier, a large number of samples need to be collected and divided into training sets and test sets according a specific algorithm. Based on the above division, the model is trained and tested.
Training refers to feeding the optimized training set samples to classification algorithms for
Method of DGA detection model
To learn and process the different characteristics of the DGA from multiple perspectives, we used six Machine Learning methods: NB, XGBoost, MLP, LSTM, RF, and SVM.
1. NB
NB is a probability-based machine learning method, which uses probability to represent uncertainty of information. It completes the process of learning and reasoning unknown information with probability rules. In this paper, we use Bayesian formula to calculate the posterior distribution of this parameter according to the prior
Experiment design
The experiment is described in the following, including the specific model training process, experimental related environmental parameters, model evaluation indicators and the selection of the optimal model.
Results and discussion
According to Fig. 3, we respectively took the five feature extraction methods to obtain five different feature sets from the original samples. Then, the six machine learning algorithms are respectively executed on each feature set to identify DGA. Aiming at each extracted feature set, six algorithms are applied to carry out experiments, the experimental results are shown in Table 3, Table 4, Table 5, Table 6, Table 7. In fact, each row of data in the five tables is an optimal result of
Conclusions
The domain name generated by DGA may be malicious so as to threaten network security. This paper studies the detection model of DGA domain name based on Machine Learning. We analyzed the characteristics of DGA domain name. Five feature extraction methods are used to obtain their characteristics sets from different aspects, including character feature, Unicode feature and word-bag model in 2-gram, 3-gram and 4-gram. Six kinds of Machine Learning algorithms are applied to achieve the
CRediT authorship contribution statement
Jian Mao: Writing - original draft. Jiemin Zhang: Supervision. Zhi Tang: Investigation. Zhiling Gu: Methodology.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Funding
The key programs of Science and Technology of Fujianprovince, China No 2018H0025 supported this research, and the Xiamen Science and Technology Project, China under grant 3502Z20173033 as well as 3502Z20183037.
Jian Mao was born in Shaxian, Fujian, P.R. China, in 1980. He received his B.S. degree in Automatic Control in 2002 and his M.S. degree in Systems Engineering in 2007 from Xiamen University, Xiamen China. He is currently pursuing the Ph.D. degree at the Department of Electronic Science and Technology in National University of Defense Technology, Changsha, China. Since 2012, he has been engaged in information security research at the key confidential information laboratory of Xiamen in China. He
References (24)
- et al.
Detecting word-based algorithmically generated domains using semantic analysis
Symmetry
(2019) - et al.
A LSTM based framework for handling multiclass imbalance in DGA botnet detection
Neurocomputing
(2018) - et al.
Detection of DNS DDoS Attacks with Random Forest Algorithm on Spark
(2018) - et al.
Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China
Energy Convers. Manage.
(2018) - et al.
Research on DNS resolution and defense technology
Comput. Netw.
(2017) Research on Abnormal Domain Name Detection Based on DNS Log Data
(2018)- et al.
Weakly supervised deep learning for the detection of domain generation algorithms
IEEE Access
(2019) - et al.
Deep learning
Nature
(2015) - et al.
A targeted Bayesian network learning for classification
Qual. Technol. Quant. M
(2019) Pedagogy of Bayes rule, confusion matrix, transition matrix, and receiver operating characteristics
Comput. Appl. Eng. Educ.
(2019)
Hybrid optimization with cryptography encryption for medical image security in Internet of Things
Neural Comput. Appl.
XGBoost Classifier for DDoS Attack Detection and Analysis in SDN-Based Cloud
Cited by (8)
Poster: P4DME: DNS Threat Mitigation with P4 In-Network Machine Learning Offload
2023, EuroP4 2023 - Proceedings of the 6th International Workshop on P4 in EuropeTowards DGA Domain Name Detection via Multi-feature Coordinated Representation and Random Forest
2023, Proceedings - 2023 11th International Conference on Information Systems and Computing Technology, ISCTech 2023DNS Tunnel Detection Scheme Based on Machine Learning in Campus Network
2022, Proceedings - 2022 4th International Conference on Machine Learning, Big Data and Business Intelligence, MLBDBI 2022Malicious DNS Detection and Prediction Using SMOTE-ENN and Hybrid Artificial Neural Network
2022, 3rd IEEE 2022 International Conference on Computing, Communication, and Intelligent Systems, ICCCIS 2022Malicious domain name detection based on Doc2vec and hybrid network
2021, IOP Conference Series: Earth and Environmental Science
Jian Mao was born in Shaxian, Fujian, P.R. China, in 1980. He received his B.S. degree in Automatic Control in 2002 and his M.S. degree in Systems Engineering in 2007 from Xiamen University, Xiamen China. He is currently pursuing the Ph.D. degree at the Department of Electronic Science and Technology in National University of Defense Technology, Changsha, China. Since 2012, he has been engaged in information security research at the key confidential information laboratory of Xiamen in China. He is currently an associate professor in Computer Engineering College of Jimei University in China. His research interest include deep learning, information security and electromagnetic information leakage.
Jiemin Zhang was born in Taiyuan, Shanxi, P.R. China, in 1964. She received the Master degree in computer science from Fudan University, Shanghai, China in 1992. Currently she is a professor in the Computer Engineering College, Jimei University, Xiamen, China. Her research interests include intelligence science and information security.
Zhi Tang was born in Hezhou, Guangxi, P.R. China, in 1997. He is now a junior majoring in computer science and technology in the college of computer engineering at jimei university, Xiamen, China. And he is currently studying and assisting teachers in the key project group of fujian provincial science and technology department.
Zhiling Gu was born in Kunming, Yunnan, P.R. China, in 1998. She is a graduate of network engineering in the school of computer engineering, jimei university, Xiamen, China. And she is currently studying and assisting teachers in the key project group of fujian provincial science and technology department.