Scalable malware detection system using big data and distributed machine learning approach

Kumar, Manish

doi:10.1007/s00500-021-06492-9

Scalable malware detection system using big data and distributed machine learning approach

Application of soft computing
Published: 05 November 2021

Volume 26, pages 3987–4003, (2022)
Cite this article

Soft Computing Aims and scope Submit manuscript

Manish Kumar ORCID: orcid.org/0000-0001-7862-0195¹

734 Accesses
8 Citations
Explore all metrics

Abstract

Computer, Internet, and Smartphone have changed our life as never before. Today, we cannot even imagine our life without these technologies. If we look around, we find everything, everywhere connected and controlled by system and software. We find amazing software and mobile applications which have become nerve of our daily life. Our dependency on this software and systems is so and so much that it is scary even to imagine, what if this system fails at any point in time. There is always a threat surrounded by various types of cyber-attacks. Every day cybercriminals are evolving their attacking strategy. Cyber-attacks using ever-more sophisticated malware are the major cause of concern for all types of users. Cyber-world has witnessed rapid changes in malware attacking strategy in the recent past. The volume, velocity, and complexity of malware are posing new challenges for malware detection systems. A scalable malware detection system with the capability to detect complex attacks is the time of need. In this paper, we have proposed a scalable malware detection system using big data and a machine learning approach. The machine learning model proposed in the system is implemented using Apache Spark which supports distributed learning. Locality-sensitive hashing is used for malware detection, which significantly reduces the malware detection time. A five-stage iterative process has been used to carry out the implementation and experimental analysis. The proposed model shown in the paper has achieved 99.8% accuracy. The proposed model has also significantly reduced the learning and malware detection time compared to models proposed by other researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Malware Detection Using Machine Learning Techniques

A Machine Learning Framework for Automatic Detection of Malware

A Survey on Different Approaches for Malware Detection Using Machine Learning Techniques

Availability of data and material

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Code availability

Not applicable.

References

Agarkar S, & Ghosh S (2020) Malware detection & classification using machine learning. 2020 IEEE International Symposium on Sustainable Energy, Signal Processing and Cyber Security (ISSSC). https://doi.org/10.1109/isssc50941.2020.9358835
Al Ahmadi BA and Martinovic I (2018) MalClassifier: Malware family classification using network flow sequence behavior. 2018 APWG Symposium on Electronic Crime Research (eCrime), San Diego, CA, pp 1-13, https://doi.org/10.1109/ECRIME.2018.8376209
Ali M, Hagen J, Oliver J (2020) scalable malware clustering using multi-stage tree parallelization. IEEE Int Conf Intell Secur Informatics (ISI) 2020:1–6. https://doi.org/10.1109/ISI49825.2020.9280546
Article Google Scholar
Anderson HS, Kharkar A, Filar B, and Roth P (2017) Evading machine learning malware detection. Black Hat
Azmoodeh A, Dehghantanha A, Choo KKR (2018) Robust malware detection for internet of (Battlefield) things devices using deep eigenspace learning. IEEE Trans Sustain Comput 4(1):88–95. https://doi.org/10.1109/TSUSC.2018.2809665
Article Google Scholar
Bermejo Higuera J, Abad Aramburu C, Bermejo Higuera JR, Sicilia Urban MA, Sicilia Montalvo JA (2020) Systematic approach to malware analysis (SAMA). Appl Sci 10(4):1360. https://doi.org/10.3390/app10041360
Article Google Scholar
Bryłkowski H (2017) Locality sensitive hashing - LSH explained. Medium. Brainly Engineering, https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4.
Burnap P, French R, Turner F, Jones K (2018) Malware classification using self organising feature maps and machine activity data. Comput Secur 73:399–410. https://doi.org/10.1016/j.cose.2017.11.016
Article Google Scholar
Catak FO (2019) Malware API call dataset. IEEE Dataport, https://doi.org/10.21227/crfp-kd68.
Chen Z, Zhang X, Kim S (2021) A learning-based static malware detection system with integrated feature. Intell Autom Soft Comput 27(3):891–908
Article Google Scholar
Cho IK, Kim TG, Shim YJ, Ryu M, Im EG (2016) Malware analysis and classification using sequence alignments. Intell Autom Soft Comput 22(3):371–377. https://doi.org/10.1080/10798587.2015.1118916
Article Google Scholar
Choi S (2020) Combined kNN classification and hierarchical similarity hash for fast malware detection. Appl Sci 10(15):5173. https://doi.org/10.3390/app10155173
Article Google Scholar
Cui Z, Xue F, Cai X, Cao Y, Wang G, Chen J (2018) Detection of malicious code variants based on deep learning. IEEE Trans Industr Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018.2822680
Article Google Scholar
Dell’Amico M (2019) Fishdbc: Flexible, incremental, scalable, hierarchical density-based clustering for arbitrary data and distance. arXiv preprint 1910.07283
Gupta S (2019) Locality sensitive hashing. Medium. Towards Data Science, https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
Gupta D, Rani R (2018) Big data framework for zero-day malware detection. Cybern Syst 49(2):103–121. https://doi.org/10.1080/01969722.2018.1429835
Article Google Scholar
Hordri NF, Ahmad NA, Yuhaniz SS, Sahibuddin S, Ariffin AF, Saupi NA, Zamani NA, Jeffry Y, Senan MF (2018) Classification of malware analytics techniques: a systematic literature review. Int J Secur Appl 12(2):9–18
Google Scholar
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) HinDroid: An intelligent android malware detection system based on structured heterogeneous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). Association for Computing Machinery, New York, NY, USA, 1507–1515. https://doi.org/10.1145/3097983.3098026
Kaspersky-Lab-Whitepaper-Machine-Learning. Accessed March 23, 2020. https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf.
Kolosnjaji B, Demontis A, Biggio B, Maiorca D, Giacinto G, Eckert C and Roli F (2018) Adversarial malware binaries: evading deep learning for malware detection in executables. In 2018 26th European Signal Processing Conference (EUSIPCO), pp 533–537. IEEE
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant permission identification for machine-learning-based android malware detection. IEEE Trans Industr Inf 14(7):3216–3225. https://doi.org/10.1109/TII.2017.2789219
Article Google Scholar
Masabo E, Kaawaase KS, Sansa-Otim J (2018) Big data. Proceedings of the 2018 International Conference on Software Engineering in Africa - SEiA 18, https://doi.org/10.1145/3195528.3195533.
Naderi H, Vinod P, Conti M, Parsa S, Alaeiyan MH (2019) Malware signature generation using locality sensitive hashing. Commun Comput Inf Sci Secur Privacy. https://doi.org/10.1007/978-981-13-7561-3_9
Article Google Scholar
Oliveira A (2019) "Malware analysis datasets: Top-1000 PE imports. IEEE Dataport, https://doi.org/10.21227/004e-v304.
Oliver J, Ali M, & Hagen J (2020) HAC-T and Fast Search for Similarity in Security. 2020 International Conference on Omni-Layer Intelligent Systems (COINS). https://doi.org/10.1109/coins49042.2020.9191381
Pagani F, Dell'Amico M, and Balzarotti D (2018) Beyond Precision and recall: Understanding uses (and misuses) of similarity hashes in binary analysis. In Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy (CODASPY '18). Association for Computing Machinery, New York, NY, USA, 354–365. https://doi.org/10.1145/3176258.3176306.
Paola A De, and Lo Re G (2020) A hybrid system for malware detection on big data - IEEE Conference Publication. Accessed March 23. https://ieeexplore.ieee.org/document/8406963/.
Paranthaman R and Thuraisingham B (2017) Malware collection and analysis. 2017 IEEE International Conference on Information Reuse and Integration (IRI), San Diego, CA, pp 26–31 https://doi.org/10.1109/IRI.2017.92.
Poudyal S, Akhtar Z, Dasgupta D and Gupta KD (2019) Malware analytics: review of data mining, machine learning and big data perspectives. 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, pp 649-656, https://doi.org/10.1109/SSCI44817.2019.9002996
Rathore H, Agarwal S, Sahay SK, Sewak M (2019) Malware detection using machine learning and deep learning. arXiv.org https://arxiv.org/abs/1904.02441v1.
Serpanos D, Michalopoulos P, Xenos G, Ieronymakis V (2021) Sisyfos: A modular and extendable open malware analysis platform. Appl Sci 11(7):2980. https://doi.org/10.3390/app11072980
Article Google Scholar
Smart Whitelisting Using Locality Sensitive Hashing (2017) Trend micro. https://www.trendmicro.com/en_us/research/17/c/smart-whitelisting-using-locality-sensitive-hashing.html
TLSH - Technical Overview. (2021) TLSH Technical Overview. https://tlsh.org/papers.html
Ullah F, Babar MA (2019) Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw 151:81–118. https://doi.org/10.1016/j.jss.2019.01.051
Article Google Scholar
Venkatraman S, Alazab M (2018) Use of data visualisation for zero-day malware detection. Secur Commun Netw 2018:1–13. https://doi.org/10.1155/2018/1728303
Article Google Scholar
Vinayakumar R, Soman K (2018) Deepmalnet: evaluating shallow and deep networks for static pe malware detection. ICT Express 4(4):255–258
Article Google Scholar
Vinayakumar R, Alazab M, Soman KP, Poornachandran P, Venkatraman S (2019) Robust intelligent malware detection using deep learning. IEEE Access 7(2019):46717–46738. https://doi.org/10.1109/access.2019.2906934
Article Google Scholar
Wassermann S and Casas P (2018) Bigmomal. Proceedings of the 2018 Workshop on Traffic Measurements for Cybersecurity - WTMC 18, https://doi.org/10.1145/3229598.3229600.
Wu Q, Zhu X, Liu B (2021) A survey of android malware static detection technology based on machine learning. Mob Inf Syst 2021:1–18. https://doi.org/10.1155/2021/8896013
Article Google Scholar
Ye Y, Li T, Adjeroh D, Iyengar SS West Virginia University, West Virginia University, Tao Li Florida International University, et al. A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR), 2017 https://doi.org/10.1145/3073559.
Yuxin D, Siyi Z (2019a) Malware detection based on deep learning algorithm. Neural Comput Appl 31(2):461–472
Article Google Scholar
Yuxin D, Siyi Z (2019b) Malware detection based on deep learning algorithm. Neural Comput Appl 31:461–472. https://doi.org/10.1007/s00521-017-3077-6
Article Google Scholar

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Department of Master of Computer Applications, M. S. Ramaiah Institute of Technology, Bangalore, India
Manish Kumar

Authors

Manish Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The paper is authored by a single author, and all the works in the paper are carried out by him.

Corresponding author

Correspondence to Manish Kumar.

Ethics declarations

Conflict of interest

The author hereby declares that they have no conflict of interest. No research grant or fund has been received from any agency to carry out the research work discussed in the manuscript.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Human animal and rights

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumar, M. Scalable malware detection system using big data and distributed machine learning approach. Soft Comput 26, 3987–4003 (2022). https://doi.org/10.1007/s00500-021-06492-9

Download citation

Accepted: 24 October 2021
Published: 05 November 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s00500-021-06492-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable malware detection system using big data and distributed machine learning approach

Abstract

Access this article

Similar content being viewed by others

Malware Detection Using Machine Learning Techniques

A Machine Learning Framework for Automatic Detection of Malware

A Survey on Different Approaches for Malware Detection Using Machine Learning Techniques

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Human animal and rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable malware detection system using big data and distributed machine learning approach

Abstract

Access this article

Similar content being viewed by others

Malware Detection Using Machine Learning Techniques

A Machine Learning Framework for Automatic Detection of Malware

A Survey on Different Approaches for Malware Detection Using Machine Learning Techniques

Availability of data and material

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Human animal and rights

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation