Abstract
Domain Generation Algorithms (DGAs) are often used for generating huge amounts of domain names to maintain command and control between the infected computer and the bot master. By establishing as needed a great number of domain names, attackers may mask their C2 servers and escape detection. Many malware families have switched to a stealthier contact approach. Therefore, the traditional methods become ineffective. Over the past decades, many researches have started to use artificial intelligence to create systems able to detect DGA in traffic, but these works do not use the same data to evaluate their models. This article proposes a comparative methodology to compare machine learning models based on unsupervised clustering and then applied this methodology to study the best models belonging to neural network methods and traditional machine learning methods to detect DGAs. We extracted 21 linguistic features based on the analysis of alphanumeric and n-gram, we studied the correlation between these features in order to reduce their number. We examine in detail those Machine learning algorithms and we discuss the drawbacks and strengths of each method with specific classes of DGA to propose a new switch case model that could be always reliable to detect DGAs.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are openly available in figshare at https://doi.org/10.6084/m9.figshare.21944090.v1.
Code Availability
Not applicable.
References
August, T., Dao, D., Niculescu, M.F.: Economics of ransomware: risk interdependence and large-scale attacks. Manage. Sci. 68(12), 8979–9002 (2022)
Pattnaik, N., Nurse, J.R., Turner, S., Mott, G., MacColl, J., Huesch, P., Sullivan, J.: It’s more than just money: the real-world harms from ransomware attacks. In: International Symposium on Human Aspects of Information Security and Assurance, pp. 261–274 (2023). Springer
Plohmann, D., Yakdan, K., Klatt, M., Bader, J., Gerhards-Padilla, E.: A comprehensive measurement study of domain generating malware. In: 25th USENIX Security Symposium (USENIX Security 16), pp. 263–278 (2016)
Tuan, T.A., Long, H.V., Taniar, D.: On detecting and classifying dga botnets and their families. Comput. Secur. 113, 102549 (2022)
Putra, M.A.R., Ahmad, T., Hostiadi, D.P.: Analysis of botnet attack communication pattern behavior on computer networks. Int. J. Intell. Eng. Syst. 15(4) (2022)
Saeed, A.M., Wang, D., Alnedhari, H.A., Mei, K., Wang, J.: A survey of machine learning and deep learning based dga detection techniques. In: International Conference on Smart Computing and Communication, pp. 133–143 (2021). Springer
Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P.S., Sun, L.: A comprehensive survey of ai-generated content (aigc): a history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023)
Hoang, X.D., Vu, X.H.: An improved model for detecting dga botnets using random forest algorithm. Information Security Journal: a Global Perspective, 1–10 (2021)
ZiCheng: Predicting domain generation algorithms with n-gram models. In: 2022 International Conference on Big Data, Information and Computer Network (BDICN), pp. 31–38 (2022). IEEE
Hassaoui, M., Hanini, M., El Kafhali, S.: A comparative study of neural networks algorithms in cyber-security to detect domain generation algorithms based on mixed classes of data. In: International Conference on Advanced Intelligent Systems for Sustainable Development, pp. 240–250 (2022). Springer
Zhou, S., Lin, L., Yuan, J., Wang, F., Ling, Z., Cui, J.: Cnn-based dga detection with high coverage. In: 2019 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 62–67 (2019). IEEE
Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)
Vij, P., Nikam, S., Bhatia, A.: Detection of algorithmically generated domain names using lstm. In: 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 1–6 (2020). IEEE
Park, K.H., Song, H.M., Do Yoo, J., Hong, S.-Y., Cho, B., Kim, K., Kim, H.K.: Unsupervised malicious domain detection with less labeling effort. Comput. Secur. 116, 102662 (2022)
Leder, F., Werner, T.: Know your enemy: Containing conficker. The Honeynet Project (2009)
Kamil, S., Norul, H.S.A.S., Firdaus, A., Usman, O.L.: The rise of ransomware: A review of attacks, detection techniques, and future challenges. In: 2022 International Conference on Business Analytics for Technology and Security (ICBATS), pp. 1–7 (2022). IEEE
Wolf, J.: Technical details of Srizbi’s domain generation algorithm (2008)
Stone-Gross, B., Cova, M., Cavallaro, L., Gilbert, B., Szydlowski, M., Kemmerer, R., Kruegel, C., Vigna, G.: Your botnet is my botnet: analysis of a botnet takeover. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, pp. 635–647 (2009)
Leder, F.S., Martini, P.: Ngbpa next generation botnet protocol analysis. In: IFIP International Information Security Conference, pp. 307–317 (2009). Springer
Porras, P., Saidi, H., Yegneswaran, V.: An analysis of conficker’s logic and rendezvous points. Technical report, Technical report, SRI International (2009)
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297 (1967)
Syukra, I., Hidayat, A., Fauzi, M.Z.: Implementation of k-medoids and fp-growth algorithms for grouping and product offering recommendations. Indonesian J. Artif. Intell. Data Min. 2(2), 107–115 (2019)
Popat, S.K., Emmanuel, M.: Review and comparative study of clustering techniques. Int. J. Comput. Sci. Inform. Technol. 5(1), 805–812 (2014)
Singrodia, V., Mitra, A., Paul, S.: A review on web scrapping and its applications. In: 2019 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2019). IEEE
Alexa: alexa. https://www.alexa.com/. Accessed 08 May 2023
statvoo: statvoo. https://statvoo.com. Accessed 08 May 2023
Cisco: Cisco. https://umbrella.cisco.com/. Accessed 08 May 2023
Bambenek: bambenek. https://osint.bambenekconsulting.com/feeds/dga-feed.txt. Accessed 08 May 2023
DGArchive: DGArchive. https://dgarchive.caad.fkie.fraunhofer.de/. Accessed 08 May 2023
Bader: bader. Accessed: 2023-05-08 (2023). https://github.com/baderj/domain-generation-algorithm
Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 2, 164–172 (1979)
Pang, Y., Xue, X., Namin, A.S.: Predicting vulnerable software components through n-gram analysis and statistical feature selection. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 543–548 (2015). IEEE
Korkmaz, M., Kocyigit, E., Sahingoz, O.K., Diri, B.: Phishing web page detection using n-gram features extracted from urls. In: 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1–6 (2021). IEEE
Stabili, D., Ferretti, L., Andreolini, M., Marchetti, M.: Daga: Detecting attacks to in-vehicle networks via n-gram analysis. IEEE Trans. Veh. Technol. 71(11), 11540–11554 (2022)
Selvi, J., Rodríguez, R.J., Soria-Olivas, E.: Detection of algorithmically generated malicious domain names using masked n-grams. Expert Syst. Appl. 124, 156–163 (2019)
Schenatto, K., De Souza, E.G., Bazzi, C.L., Gavioli, A., Betzek, N.M., Beneduzzi, H.M.: Normalization of data for delineating management zones. Comput. Electron. Agric. 143, 238–248 (2017)
Cohen, I., Huang, Y., Chen, J., Benesty, J., Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. Noise reduction in speech processing, 1–4 (2009)
Nagulapati, V.M., Lee, H., Jung, D., Brigljevic, B., Choi, Y., Lim, H.: Capacity estimation of batteries: Influence of training dataset size and diversity on data driven prognostic models. Reliab. Eng. Syst. Saf. 216, 108048 (2021)
Nguyen, Q.H., Ly, H.-B., Ho, L.S., Al-Ansari, N., Le, H.V., Tran, V.Q., Prakash, I., Pham, B.T.: Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Probl. Eng. 2021 (2021)
Tharani, S., Yamini, C.: Classification using convolutional neural network for heart and diabetics datasets. Int. J. Adv. Res. Comput. Commun. Eng. 5(12), 417–22 (2016)
Berman, D.S.: Dga capsnet: 1d application of capsule networks to dga detection. Information 10(5), 157 (2019)
McKinney, W.: Pandas, python data analysis library. https://pandas.pydata.org/. Accessed 08 May 2023
Lux, M., Bertini, M.: Open source column: deep learning with keras. ACM SIGMultimed. Rec. 10(4), 7–7 (2019)
Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., Mueller, A.: Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Comput. Commun. 19(1), 29–33 (2015)
Pang, B., Nijkamp, E., Wu, Y.N.: Deep learning with tensorflow: a review. J. Educ. Behav. Stat. 45(2), 227–248 (2020)
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
Hassaoui, M., Hanini, M., El Kafhali, S.: Domain generated algorithms detection applying a combination of a deep feature selection and traditional machine learning models. J. Comput. Secur. 31(1), 85–105 (2023)
Acknowledgements
The authors thank the anonymous reviewers for their valuable comments, which have helped us to considerably improve the content, quality, and presentation of this article.
Funding
There is no funding for this research paper.
Author information
Authors and Affiliations
Contributions
These authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical Approval
Not applicable.
Consent to Participate
Not applicable.
Consent for Publication
Yes, we agree to publish this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hassaoui, M., Hanini, M. & El Kafhali, S. Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis. J Netw Syst Manage 32, 18 (2024). https://doi.org/10.1007/s10922-023-09793-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10922-023-09793-6