Full length article
DNS anti-attack machine learning model for DGA domain name detection

https://doi.org/10.1016/j.phycom.2020.101069Get rights and content

Abstract

Domain Name System (DNS) is a vital service for the Internet, Domain names generated by Domain Generation Algorithm (DGA) hidden in DNS distributed database have potential risks because it may be malicious. It is difficult for traditional methods to detect DGA domain names for the reason that DGA domain names are random, dynamic and numerous. In this paper, a machine learning-based DGA domain name detection method is proposed. We analyzed the characteristics of DGA domain name by five feature extraction methods. Then we apply six kinds of Machine Learning algorithms with five types of feature sets to obtain thirty candidate DGA detection models. The optimized DGA detection model is obtained in comparative experiments with the evaluation indexes of Accuracy, Precision, Recall, F1 score and Training Time. The experimental results show that the approach based on Machine Learning can effectively identify DGA domain names.

Introduction

Domain Name System (DNS) [1], [2] as an Internet vital service provides a mapping from domain name to IP address, it enables users to easily access Internet resources using domain name rather than IP address. DNS relay on distributed databases that store mapping values to supply the significant service to Internet. However some of those values in the database may be malicious domain names offered by attacker. These malicious domain names hidden in DNS portend great security risks to the Internet. For example, malicious domain name may be utilized while the attacker manipulates Botnets to lunch a large-scale attack against DNS server, causing the interruption or paralysis of DNS service at some time.

In general, the traditional detection methods for malicious domain name are based on the rule matching. Increasing malicious domain names are blacklisted continually. Nowadays blacklisting of malicious domain names is no longer sufficient to prevent the malware from successfully commanding and controlling Botnets. In fact, the attackers apply Domain Generation Algorithms (DGA) [3], [4] to generate a large number of malicious domain names that are hidden in distributed database of DNS. The attacks occur possibly only when malicious domain names are activated. DGA can generate thousands of malicious domain names every day. Both the malware and the botmaster can dynamically generate more domain names, repeating the process while needed because the generated domain names are disposable. Most of DGA domain names will not be activated for a long time. Only a small number of them are chosen as the follow-up attack domain names, then it is more difficult to detect than the typical malicious domain names. The traditional detection method for malicious domain name is difficult to cope with the massive domain names generated by DGA.

Recently, Artificial Intelligence (AI) has made breakthroughs, covering a variety of machine learning algorithms and deep learning algorithms [5]. These algorithms have been widely used in many fields. Machine Learning can solve a lot of problems of specific classification and recognition. At present, some researchers begin to apply machine-learning methods to detect the DGA domain name automatically. For example, a method of DGA domain name detection using deep learning method is described in [3]. Its main contribution is to propose a heuristic tagging method in order to obtain a large number of labeled samples for deep learning. In this paper, we research the feature extraction algorithm for DGA domain name detection, and match with machine learning algorithm to find the suitable classification model for the automatic detection of DGA domain name.

There are a lot of common algorithms in Machine Learning. The algorithm selection depends largely on the training set and its characteristics, and is closely related to the classification and recognition target, especially the specific usage scenario. According to detection requirements and DGA characteristics, the experiment in this paper is completed based on the six kinds of classification algorithms. They are Naive Bayesian (NB) [6], [7], [8], Extreme Gradient Boosting (XGBoost) [9], [10], Multi-Layer Perceptron (MLP) [11], [12], Long short-term memory (LSTM) [13], [14], [15], Random Forest (RF) [16], [17], [18] as well as Support Vector Machine (SVM) [19], [20], [21]. We proposed five types of features for DGA domain name detection, names character feature, Unicode feature and word-bag model [22], [23] in 2-gram, 3-gram and 4-gram.

The contributions of this paper are as follow: We match six algorithms with five features to obtain the optimal DGA detection model. In order to get the optimal model, we have trained thirty candidate detection models based on the 6*5 matching. Moreover, each candidate model is acquired through a large number of comparative experiments with parameter adjustment. The evaluation indexes on the model are Accuracy, Precision, Recall, F1 score and Training Time. The comparative analysis with DGA candidate detection models are given in detail and the optimized DGA detection model is showed.

The paper is organized as follows. In the next section, we describe the collection of DGA samples and the extraction of the DGA features. In Section 3, the method of training model for detection DGA is presented. In Section 4, we describe experimental chart, model evaluation Indicators and procedures of experimental environment. Result and discuss are placed in Section 5. Finally, in Section 6, we give a conclusion for this paper and the work in the future.

Section snippets

DGA features analysis

The collection of samples and the quality of the samples are critical and directly determine the effectiveness of Machine Learning. In order to obtain a machine learning-based DGA domain name detection classifier, a large number of samples need to be collected and divided into training sets and test sets according a specific algorithm. Based on the above division, the model is trained and tested.

Training refers to feeding the optimized training set samples to classification algorithms for

Method of DGA detection model

To learn and process the different characteristics of the DGA from multiple perspectives, we used six Machine Learning methods: NB, XGBoost, MLP, LSTM, RF, and SVM.

1. NB

NB is a probability-based machine learning method, which uses probability to represent uncertainty of information. It completes the process of learning and reasoning unknown information with probability rules. In this paper, we use Bayesian formula to calculate the posterior distribution of this parameter according to the prior

Experiment design

The experiment is described in the following, including the specific model training process, experimental related environmental parameters, model evaluation indicators and the selection of the optimal model.

Results and discussion

According to Fig. 3, we respectively took the five feature extraction methods to obtain five different feature sets from the original samples. Then, the six machine learning algorithms are respectively executed on each feature set to identify DGA. Aiming at each extracted feature set, six algorithms are applied to carry out experiments, the experimental results are shown in Table 3, Table 4, Table 5, Table 6, Table 7. In fact, each row of data in the five tables is an optimal result of

Conclusions

The domain name generated by DGA may be malicious so as to threaten network security. This paper studies the detection model of DGA domain name based on Machine Learning. We analyzed the characteristics of DGA domain name. Five feature extraction methods are used to obtain their characteristics sets from different aspects, including character feature, Unicode feature and word-bag model in 2-gram, 3-gram and 4-gram. Six kinds of Machine Learning algorithms are applied to achieve the

CRediT authorship contribution statement

Jian Mao: Writing - original draft. Jiemin Zhang: Supervision. Zhi Tang: Investigation. Zhiling Gu: Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding

The key programs of Science and Technology of Fujianprovince, China No 2018H0025 supported this research, and the Xiamen Science and Technology Project, China under grant 3502Z20173033 as well as 3502Z20183037.

Jian Mao was born in Shaxian, Fujian, P.R. China, in 1980. He received his B.S. degree in Automatic Control in 2002 and his M.S. degree in Systems Engineering in 2007 from Xiamen University, Xiamen China. He is currently pursuing the Ph.D. degree at the Department of Electronic Science and Technology in National University of Defense Technology, Changsha, China. Since 2012, he has been engaged in information security research at the key confidential information laboratory of Xiamen in China. He

References (24)

  • ElhosenyMohamed et al.

    Hybrid optimization with cryptography encryption for medical image security in Internet of Things

    Neural Comput. Appl.

    (2018)
  • ChenZ. et al.

    XGBoost Classifier for DDoS Attack Detection and Analysis in SDN-Based Cloud

    (2018)
  • Cited by (8)

    View all citing articles on Scopus

    Jian Mao was born in Shaxian, Fujian, P.R. China, in 1980. He received his B.S. degree in Automatic Control in 2002 and his M.S. degree in Systems Engineering in 2007 from Xiamen University, Xiamen China. He is currently pursuing the Ph.D. degree at the Department of Electronic Science and Technology in National University of Defense Technology, Changsha, China. Since 2012, he has been engaged in information security research at the key confidential information laboratory of Xiamen in China. He is currently an associate professor in Computer Engineering College of Jimei University in China. His research interest include deep learning, information security and electromagnetic information leakage.

    Jiemin Zhang was born in Taiyuan, Shanxi, P.R. China, in 1964. She received the Master degree in computer science from Fudan University, Shanghai, China in 1992. Currently she is a professor in the Computer Engineering College, Jimei University, Xiamen, China. Her research interests include intelligence science and information security.

    Zhi Tang was born in Hezhou, Guangxi, P.R. China, in 1997. He is now a junior majoring in computer science and technology in the college of computer engineering at jimei university, Xiamen, China. And he is currently studying and assisting teachers in the key project group of fujian provincial science and technology department.

    Zhiling Gu was born in Kunming, Yunnan, P.R. China, in 1998. She is a graduate of network engineering in the school of computer engineering, jimei university, Xiamen, China. And she is currently studying and assisting teachers in the key project group of fujian provincial science and technology department.

    View full text