Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis

Hassaoui, Mohamed; Hanini, Mohamed; El Kafhali, Said

doi:10.1007/s10922-023-09793-6

Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis

Published: 02 January 2024

Volume 32, article number 18, (2024)
Cite this article

Journal of Network and Systems Management Aims and scope Submit manuscript

Mohamed Hassaoui¹,
Mohamed Hanini¹ &
Said El Kafhali¹

212 Accesses
1 Citation
Explore all metrics

Abstract

Domain Generation Algorithms (DGAs) are often used for generating huge amounts of domain names to maintain command and control between the infected computer and the bot master. By establishing as needed a great number of domain names, attackers may mask their C2 servers and escape detection. Many malware families have switched to a stealthier contact approach. Therefore, the traditional methods become ineffective. Over the past decades, many researches have started to use artificial intelligence to create systems able to detect DGA in traffic, but these works do not use the same data to evaluate their models. This article proposes a comparative methodology to compare machine learning models based on unsupervised clustering and then applied this methodology to study the best models belonging to neural network methods and traditional machine learning methods to detect DGAs. We extracted 21 linguistic features based on the analysis of alphanumeric and n-gram, we studied the correlation between these features in order to reduce their number. We examine in detail those Machine learning algorithms and we discuss the drawbacks and strengths of each method with specific classes of DGA to propose a new switch case model that could be always reliable to detect DGAs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Article 26 March 2021

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Article Open access 19 September 2022

Autoencoders and their applications in machine learning: a survey

Article Open access 03 February 2024

Data Availability

The data that support the findings of this study are openly available in figshare at https://doi.org/10.6084/m9.figshare.21944090.v1.

Code Availability

Not applicable.

References

August, T., Dao, D., Niculescu, M.F.: Economics of ransomware: risk interdependence and large-scale attacks. Manage. Sci. 68(12), 8979–9002 (2022)
Article Google Scholar
Pattnaik, N., Nurse, J.R., Turner, S., Mott, G., MacColl, J., Huesch, P., Sullivan, J.: It’s more than just money: the real-world harms from ransomware attacks. In: International Symposium on Human Aspects of Information Security and Assurance, pp. 261–274 (2023). Springer
Plohmann, D., Yakdan, K., Klatt, M., Bader, J., Gerhards-Padilla, E.: A comprehensive measurement study of domain generating malware. In: 25th USENIX Security Symposium (USENIX Security 16), pp. 263–278 (2016)
Tuan, T.A., Long, H.V., Taniar, D.: On detecting and classifying dga botnets and their families. Comput. Secur. 113, 102549 (2022)
Article Google Scholar
Putra, M.A.R., Ahmad, T., Hostiadi, D.P.: Analysis of botnet attack communication pattern behavior on computer networks. Int. J. Intell. Eng. Syst. 15(4) (2022)
Saeed, A.M., Wang, D., Alnedhari, H.A., Mei, K., Wang, J.: A survey of machine learning and deep learning based dga detection techniques. In: International Conference on Smart Computing and Communication, pp. 133–143 (2021). Springer
Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P.S., Sun, L.: A comprehensive survey of ai-generated content (aigc): a history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023)
Hoang, X.D., Vu, X.H.: An improved model for detecting dga botnets using random forest algorithm. Information Security Journal: a Global Perspective, 1–10 (2021)
ZiCheng: Predicting domain generation algorithms with n-gram models. In: 2022 International Conference on Big Data, Information and Computer Network (BDICN), pp. 31–38 (2022). IEEE
Hassaoui, M., Hanini, M., El Kafhali, S.: A comparative study of neural networks algorithms in cyber-security to detect domain generation algorithms based on mixed classes of data. In: International Conference on Advanced Intelligent Systems for Sustainable Development, pp. 240–250 (2022). Springer
Zhou, S., Lin, L., Yuan, J., Wang, F., Ling, Z., Cui, J.: Cnn-based dga detection with high coverage. In: 2019 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 62–67 (2019). IEEE
Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)
Vij, P., Nikam, S., Bhatia, A.: Detection of algorithmically generated domain names using lstm. In: 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 1–6 (2020). IEEE
Park, K.H., Song, H.M., Do Yoo, J., Hong, S.-Y., Cho, B., Kim, K., Kim, H.K.: Unsupervised malicious domain detection with less labeling effort. Comput. Secur. 116, 102662 (2022)
Article Google Scholar
Leder, F., Werner, T.: Know your enemy: Containing conficker. The Honeynet Project (2009)
Kamil, S., Norul, H.S.A.S., Firdaus, A., Usman, O.L.: The rise of ransomware: A review of attacks, detection techniques, and future challenges. In: 2022 International Conference on Business Analytics for Technology and Security (ICBATS), pp. 1–7 (2022). IEEE
Wolf, J.: Technical details of Srizbi’s domain generation algorithm (2008)
Stone-Gross, B., Cova, M., Cavallaro, L., Gilbert, B., Szydlowski, M., Kemmerer, R., Kruegel, C., Vigna, G.: Your botnet is my botnet: analysis of a botnet takeover. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, pp. 635–647 (2009)
Leder, F.S., Martini, P.: Ngbpa next generation botnet protocol analysis. In: IFIP International Information Security Conference, pp. 307–317 (2009). Springer
Porras, P., Saidi, H., Yegneswaran, V.: An analysis of conficker’s logic and rendezvous points. Technical report, Technical report, SRI International (2009)
Google Scholar
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297 (1967)
Syukra, I., Hidayat, A., Fauzi, M.Z.: Implementation of k-medoids and fp-growth algorithms for grouping and product offering recommendations. Indonesian J. Artif. Intell. Data Min. 2(2), 107–115 (2019)
Article Google Scholar
Popat, S.K., Emmanuel, M.: Review and comparative study of clustering techniques. Int. J. Comput. Sci. Inform. Technol. 5(1), 805–812 (2014)
Google Scholar
Singrodia, V., Mitra, A., Paul, S.: A review on web scrapping and its applications. In: 2019 International Conference on Computer Communication and Informatics (ICCCI), pp. 1–6 (2019). IEEE
Alexa: alexa. https://www.alexa.com/. Accessed 08 May 2023
statvoo: statvoo. https://statvoo.com. Accessed 08 May 2023
Cisco: Cisco. https://umbrella.cisco.com/. Accessed 08 May 2023
Bambenek: bambenek. https://osint.bambenekconsulting.com/feeds/dga-feed.txt. Accessed 08 May 2023
DGArchive: DGArchive. https://dgarchive.caad.fkie.fraunhofer.de/. Accessed 08 May 2023
Bader: bader. Accessed: 2023-05-08 (2023). https://github.com/baderj/domain-generation-algorithm
Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 2, 164–172 (1979)
Article Google Scholar
Pang, Y., Xue, X., Namin, A.S.: Predicting vulnerable software components through n-gram analysis and statistical feature selection. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 543–548 (2015). IEEE
Korkmaz, M., Kocyigit, E., Sahingoz, O.K., Diri, B.: Phishing web page detection using n-gram features extracted from urls. In: 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1–6 (2021). IEEE
Stabili, D., Ferretti, L., Andreolini, M., Marchetti, M.: Daga: Detecting attacks to in-vehicle networks via n-gram analysis. IEEE Trans. Veh. Technol. 71(11), 11540–11554 (2022)
Article Google Scholar
Selvi, J., Rodríguez, R.J., Soria-Olivas, E.: Detection of algorithmically generated malicious domain names using masked n-grams. Expert Syst. Appl. 124, 156–163 (2019)
Article Google Scholar
Schenatto, K., De Souza, E.G., Bazzi, C.L., Gavioli, A., Betzek, N.M., Beneduzzi, H.M.: Normalization of data for delineating management zones. Comput. Electron. Agric. 143, 238–248 (2017)
Article Google Scholar
Cohen, I., Huang, Y., Chen, J., Benesty, J., Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. Noise reduction in speech processing, 1–4 (2009)
Nagulapati, V.M., Lee, H., Jung, D., Brigljevic, B., Choi, Y., Lim, H.: Capacity estimation of batteries: Influence of training dataset size and diversity on data driven prognostic models. Reliab. Eng. Syst. Saf. 216, 108048 (2021)
Article Google Scholar
Nguyen, Q.H., Ly, H.-B., Ho, L.S., Al-Ansari, N., Le, H.V., Tran, V.Q., Prakash, I., Pham, B.T.: Influence of data splitting on performance of machine learning models in prediction of shear strength of soil. Math. Probl. Eng. 2021 (2021)
Tharani, S., Yamini, C.: Classification using convolutional neural network for heart and diabetics datasets. Int. J. Adv. Res. Comput. Commun. Eng. 5(12), 417–22 (2016)
Article Google Scholar
Berman, D.S.: Dga capsnet: 1d application of capsule networks to dga detection. Information 10(5), 157 (2019)
Article Google Scholar
McKinney, W.: Pandas, python data analysis library. https://pandas.pydata.org/. Accessed 08 May 2023
Lux, M., Bertini, M.: Open source column: deep learning with keras. ACM SIGMultimed. Rec. 10(4), 7–7 (2019)
Article Google Scholar
Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., Mueller, A.: Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Comput. Commun. 19(1), 29–33 (2015)
Pang, B., Nijkamp, E., Wu, Y.N.: Deep learning with tensorflow: a review. J. Educ. Behav. Stat. 45(2), 227–248 (2020)
Article Google Scholar
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet Google Scholar
Hassaoui, M., Hanini, M., El Kafhali, S.: Domain generated algorithms detection applying a combination of a deep feature selection and traditional machine learning models. J. Comput. Secur. 31(1), 85–105 (2023)
Article Google Scholar

Download references

Acknowledgements

The authors thank the anonymous reviewers for their valuable comments, which have helped us to considerably improve the content, quality, and presentation of this article.

Funding

There is no funding for this research paper.

Author information

Authors and Affiliations

Faculty of Sciences and Techniques, Computer, Networks, Modeling, and Mobility Laboratory (IR2M), Hassan First University of Settat, 26000, Settat, Morocco
Mohamed Hassaoui, Mohamed Hanini & Said El Kafhali

Authors

Mohamed Hassaoui
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Hanini
View author publications
You can also search for this author in PubMed Google Scholar
Said El Kafhali
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Said El Kafhali.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Yes, we agree to publish this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hassaoui, M., Hanini, M. & El Kafhali, S. Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis. J Netw Syst Manage 32, 18 (2024). https://doi.org/10.1007/s10922-023-09793-6

Download citation

Received: 19 August 2023
Revised: 20 November 2023
Accepted: 22 November 2023
Published: 02 January 2024
DOI: https://doi.org/10.1007/s10922-023-09793-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis

Abstract

Access this article

Similar content being viewed by others

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Autoencoders and their applications in machine learning: a survey

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised Clustering for a Comparative Methodology of Machine Learning Models to Detect Domain-Generated Algorithms Based on an Alphanumeric Features Analysis

Abstract

Access this article

Similar content being viewed by others

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Autoencoders and their applications in machine learning: a survey

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation