Skip to main content
Log in

Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

Domain Generation Algorithms (DGAs) are often used for generating huge amounts of domain names to maintain command and control between the infected computer and the bot master. By establishing as needed a great number of domain names, attackers may mask their C2 servers and escape detection. Many malware families have switched to a stealthier contact approach. Therefore, the traditional methods become ineffective. Over the past decades, many researches have started to use artificial intelligence to create systems able to detect DGA in traffic, but these works do not use the same data to evaluate their models. This article proposes a comparative methodology to compare machine learning models based on unsupervised clustering and then applied this methodology to study the best models belonging to neural network methods and traditional machine learning methods to detect DGAs. We extracted 21 linguistic features based on the analysis of alphanumeric and n-gram, we studied the correlation between these features in order to reduce their number. We examine in detail those Machine learning algorithms and we discuss the drawbacks and strengths of each method with specific classes of DGA to propose a new switch case model that could be always reliable to detect DGAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Data Availability

The data that support the findings of this study are openly available in figshare at https://doi.org/10.6084/m9.figshare.21944090.v1.

Code Availability

Not applicable.

References

  1. August, T., Dao, D., Niculescu, M.F.: Economics of ransomware: risk interdependence and large-scale attacks. Manage. Sci. 68(12), 8979–9002 (2022)

    Article  Google Scholar 

  2. Pattnaik, N., Nurse, J.R., Turner, S., Mott, G., MacColl, J., Huesch, P., Sullivan, J.: It’s more than just money: the real-world harms from ransomware attacks. In: International Symposium on Human Aspects of Information Security and Assurance, pp. 261–274 (2023). Springer

  3. Plohmann, D., Yakdan, K., Klatt, M., Bader, J., Gerhards-Padilla, E.: A comprehensive measurement study of domain generating malware. In: 25th USENIX Security Symposium (USENIX Security 16), pp. 263–278 (2016)

  4. Tuan, T.A., Long, H.V., Taniar, D.: On detecting and classifying dga botnets and their families. Comput. Secur. 113, 102549 (2022)

    Article  Google Scholar 

  5. Putra, M.A.R., Ahmad, T., Hostiadi, D.P.: Analysis of botnet attack communication pattern behavior on computer networks. Int. J. Intell. Eng. Syst. 15(4) (2022)

  6. Saeed, A.M., Wang, D., Alnedhari, H.A., Mei, K., Wang, J.: A survey of machine learning and deep learning based dga detection techniques. In: International Conference on Smart Computing and Communication, pp. 133–143 (2021). Springer

  7. Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P.S., Sun, L.: A comprehensive survey of ai-generated content (aigc): a history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023)

  8. Hoang, X.D., Vu, X.H.: An improved model for detecting dga botnets using random forest algorithm. Information Security Journal: a Global Perspective, 1–10 (2021)

  9. ZiCheng: Predicting domain generation algorithms with n-gram models. In: 2022 International Conference on Big Data, Information and Computer Network (BDICN), pp. 31–38 (2022). IEEE

  10. Hassaoui, M., Hanini, M., El Kafhali, S.: A comparative study of neural networks algorithms in cyber-security to detect domain generation algorithms based on mixed classes of data. In: International Conference on Advanced Intelligent Systems for Sustainable Development, pp. 240–250 (2022). Springer

  11. Zhou, S., Lin, L., Yuan, J., Wang, F., Ling, Z., Cui, J.: Cnn-based dga detection with high coverage. In: 2019 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 62–67 (2019). IEEE

  12. Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)

  13. Vij, P., Nikam, S., Bhatia, A.: Detection of algorithmically generated domain names using lstm. In: 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 1–6 (2020). IEEE

  14. Park, K.H., Song, H.M., Do Yoo, J., Hong, S.-Y., Cho, B., Kim, K., Kim, H.K.: Unsupervised malicious domain detection with less labeling effort. Comput. Secur. 116, 102662 (2022)

    Article  Google Scholar 

  15. Leder, F., Werner, T.: Know your enemy: Containing conficker. The Honeynet Project (2009)

  16. Kamil, S., Norul, H.S.A.S., Firdaus, A., Usman, O.L.: The rise of ransomware: A review of attacks, detection techniques, and future challenges. In: 2022 International Conference on Business Analytics for Technology and Security (ICBATS), pp. 1–7 (2022). IEEE

  17. Wolf, J.: Technical details of Srizbi’s domain generation algorithm (2008)

  18. Stone-Gross, B., Cova, M., Cavallaro, L., Gilbert, B., Szydlowski, M., Kemmerer, R., Kruegel, C., Vigna, G.: Your botnet is my botnet: analysis of a botnet takeover. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, pp. 635–647 (2009)

  19. Leder, F.S., Martini, P.: Ngbpa next generation botnet protocol analysis. In: IFIP International Information Security Conference, pp. 307–317 (2009). Springer

  20. Porras, P., Saidi, H., Yegneswaran, V.: An analysis of conficker’s logic and rendezvous points. Technical report, Technical report, SRI International (2009)

    Google Scholar 

  21. MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297 (1967)

  22. Syukra, I., Hidayat, A., Fauzi, M.Z.: Implementation of k-medoids and fp-growth algorithms for grouping and product offering recommendations. Indonesian J. Artif. Intell. Data Min. 2(2), 107–115 (2019)

    Article  Google Scholar 

  23. Popat, S.K., Emmanuel, M.: Review and comparative study of clustering techniques. Int. J. Comput. Sci. Inform. Technol. 5(1), 805–812 (2014)

    Google Scholar 

  24. Singrodia, V., Mitra, A., Paul, S.: A review on web scrapping and its applications. In: 2019 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2019). IEEE

  25. Alexa: alexa. https://www.alexa.com/. Accessed 08 May 2023

  26. statvoo: statvoo. https://statvoo.com. Accessed 08 May 2023

  27. Cisco: Cisco. https://umbrella.cisco.com/. Accessed 08 May 2023

  28. Bambenek: bambenek. https://osint.bambenekconsulting.com/feeds/dga-feed.txt. Accessed 08 May 2023

  29. DGArchive: DGArchive. https://dgarchive.caad.fkie.fraunhofer.de/. Accessed 08 May 2023

  30. Bader: bader. Accessed: 2023-05-08 (2023). https://github.com/baderj/domain-generation-algorithm

  31. Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 2, 164–172 (1979)

    Article  Google Scholar 

  32. Pang, Y., Xue, X., Namin, A.S.: Predicting vulnerable software components through n-gram analysis and statistical feature selection. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 543–548 (2015). IEEE

  33. Korkmaz, M., Kocyigit, E., Sahingoz, O.K., Diri, B.: Phishing web page detection using n-gram features extracted from urls. In: 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1–6 (2021). IEEE

  34. Stabili, D., Ferretti, L., Andreolini, M., Marchetti, M.: Daga: Detecting attacks to in-vehicle networks via n-gram analysis. IEEE Trans. Veh. Technol. 71(11), 11540–11554 (2022)

    Article  Google Scholar 

  35. Selvi, J., Rodríguez, R.J., Soria-Olivas, E.: Detection of algorithmically generated malicious domain names using masked n-grams. Expert Syst. Appl. 124, 156–163 (2019)

    Article  Google Scholar 

  36. Schenatto, K., De Souza, E.G., Bazzi, C.L., Gavioli, A., Betzek, N.M., Beneduzzi, H.M.: Normalization of data for delineating management zones. Comput. Electron. Agric. 143, 238–248 (2017)

    Article  Google Scholar 

  37. Cohen, I., Huang, Y., Chen, J., Benesty, J., Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. Noise reduction in speech processing, 1–4 (2009)

  38. Nagulapati, V.M., Lee, H., Jung, D., Brigljevic, B., Choi, Y., Lim, H.: Capacity estimation of batteries: Influence of training dataset size and diversity on data driven prognostic models. Reliab. Eng. Syst. Saf. 216, 108048 (2021)

    Article  Google Scholar 

  39. Nguyen, Q.H., Ly, H.-B., Ho, L.S., Al-Ansari, N., Le, H.V., Tran, V.Q., Prakash, I., Pham, B.T.: Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Probl. Eng. 2021 (2021)

  40. Tharani, S., Yamini, C.: Classification using convolutional neural network for heart and diabetics datasets. Int. J. Adv. Res. Comput. Commun. Eng. 5(12), 417–22 (2016)

    Article  Google Scholar 

  41. Berman, D.S.: Dga capsnet: 1d application of capsule networks to dga detection. Information 10(5), 157 (2019)

    Article  Google Scholar 

  42. McKinney, W.: Pandas, python data analysis library. https://pandas.pydata.org/. Accessed 08 May 2023

  43. Lux, M., Bertini, M.: Open source column: deep learning with keras. ACM SIGMultimed. Rec. 10(4), 7–7 (2019)

    Article  Google Scholar 

  44. Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., Mueller, A.: Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Comput. Commun. 19(1), 29–33 (2015)

  45. Pang, B., Nijkamp, E., Wu, Y.N.: Deep learning with tensorflow: a review. J. Educ. Behav. Stat. 45(2), 227–248 (2020)

    Article  Google Scholar 

  46. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  Google Scholar 

  47. Hassaoui, M., Hanini, M., El Kafhali, S.: Domain generated algorithms detection applying a combination of a deep feature selection and traditional machine learning models. J. Comput. Secur. 31(1), 85–105 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable comments, which have helped us to considerably improve the content, quality, and presentation of this article.

Funding

There is no funding for this research paper.

Author information

Authors and Affiliations

Authors

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Said El Kafhali.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Yes, we agree to publish this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hassaoui, M., Hanini, M. & El Kafhali, S. Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis. J Netw Syst Manage 32, 18 (2024). https://doi.org/10.1007/s10922-023-09793-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10922-023-09793-6

Keywords

Navigation