Abstract
Computer, Internet, and Smartphone have changed our life as never before. Today, we cannot even imagine our life without these technologies. If we look around, we find everything, everywhere connected and controlled by system and software. We find amazing software and mobile applications which have become nerve of our daily life. Our dependency on this software and systems is so and so much that it is scary even to imagine, what if this system fails at any point in time. There is always a threat surrounded by various types of cyber-attacks. Every day cybercriminals are evolving their attacking strategy. Cyber-attacks using ever-more sophisticated malware are the major cause of concern for all types of users. Cyber-world has witnessed rapid changes in malware attacking strategy in the recent past. The volume, velocity, and complexity of malware are posing new challenges for malware detection systems. A scalable malware detection system with the capability to detect complex attacks is the time of need. In this paper, we have proposed a scalable malware detection system using big data and a machine learning approach. The machine learning model proposed in the system is implemented using Apache Spark which supports distributed learning. Locality-sensitive hashing is used for malware detection, which significantly reduces the malware detection time. A five-stage iterative process has been used to carry out the implementation and experimental analysis. The proposed model shown in the paper has achieved 99.8% accuracy. The proposed model has also significantly reduced the learning and malware detection time compared to models proposed by other researchers.
Similar content being viewed by others
Availability of data and material
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Code availability
Not applicable.
References
Agarkar S, & Ghosh S (2020) Malware detection & classification using machine learning. 2020 IEEE International Symposium on Sustainable Energy, Signal Processing and Cyber Security (ISSSC). https://doi.org/10.1109/isssc50941.2020.9358835
Al Ahmadi BA and Martinovic I (2018) MalClassifier: Malware family classification using network flow sequence behavior. 2018 APWG Symposium on Electronic Crime Research (eCrime), San Diego, CA, pp 1-13, https://doi.org/10.1109/ECRIME.2018.8376209
Ali M, Hagen J, Oliver J (2020) scalable malware clustering using multi-stage tree parallelization. IEEE Int Conf Intell Secur Informatics (ISI) 2020:1–6. https://doi.org/10.1109/ISI49825.2020.9280546
Anderson HS, Kharkar A, Filar B, and Roth P (2017) Evading machine learning malware detection. Black Hat
Azmoodeh A, Dehghantanha A, Choo KKR (2018) Robust malware detection for internet of (Battlefield) things devices using deep eigenspace learning. IEEE Trans Sustain Comput 4(1):88–95. https://doi.org/10.1109/TSUSC.2018.2809665
Bermejo Higuera J, Abad Aramburu C, Bermejo Higuera JR, Sicilia Urban MA, Sicilia Montalvo JA (2020) Systematic approach to malware analysis (SAMA). Appl Sci 10(4):1360. https://doi.org/10.3390/app10041360
Bryłkowski H (2017) Locality sensitive hashing - LSH explained. Medium. Brainly Engineering, https://medium.com/engineering-brainly/locality-sensitive-hashing-explained-304eb39291e4.
Burnap P, French R, Turner F, Jones K (2018) Malware classification using self organising feature maps and machine activity data. Comput Secur 73:399–410. https://doi.org/10.1016/j.cose.2017.11.016
Catak FO (2019) Malware API call dataset. IEEE Dataport, https://doi.org/10.21227/crfp-kd68.
Chen Z, Zhang X, Kim S (2021) A learning-based static malware detection system with integrated feature. Intell Autom Soft Comput 27(3):891–908
Cho IK, Kim TG, Shim YJ, Ryu M, Im EG (2016) Malware analysis and classification using sequence alignments. Intell Autom Soft Comput 22(3):371–377. https://doi.org/10.1080/10798587.2015.1118916
Choi S (2020) Combined kNN classification and hierarchical similarity hash for fast malware detection. Appl Sci 10(15):5173. https://doi.org/10.3390/app10155173
Cui Z, Xue F, Cai X, Cao Y, Wang G, Chen J (2018) Detection of malicious code variants based on deep learning. IEEE Trans Industr Inf 14(7):3187–3196. https://doi.org/10.1109/TII.2018.2822680
Dell’Amico M (2019) Fishdbc: Flexible, incremental, scalable, hierarchical density-based clustering for arbitrary data and distance. arXiv preprint 1910.07283
Gupta S (2019) Locality sensitive hashing. Medium. Towards Data Science, https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
Gupta D, Rani R (2018) Big data framework for zero-day malware detection. Cybern Syst 49(2):103–121. https://doi.org/10.1080/01969722.2018.1429835
Hordri NF, Ahmad NA, Yuhaniz SS, Sahibuddin S, Ariffin AF, Saupi NA, Zamani NA, Jeffry Y, Senan MF (2018) Classification of malware analytics techniques: a systematic literature review. Int J Secur Appl 12(2):9–18
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) HinDroid: An intelligent android malware detection system based on structured heterogeneous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). Association for Computing Machinery, New York, NY, USA, 1507–1515. https://doi.org/10.1145/3097983.3098026
Kaspersky-Lab-Whitepaper-Machine-Learning. Accessed March 23, 2020. https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf.
Kolosnjaji B, Demontis A, Biggio B, Maiorca D, Giacinto G, Eckert C and Roli F (2018) Adversarial malware binaries: evading deep learning for malware detection in executables. In 2018 26th European Signal Processing Conference (EUSIPCO), pp 533–537. IEEE
Li J, Sun L, Yan Q, Li Z, Srisa-an W, Ye H (2018) Significant permission identification for machine-learning-based android malware detection. IEEE Trans Industr Inf 14(7):3216–3225. https://doi.org/10.1109/TII.2017.2789219
Masabo E, Kaawaase KS, Sansa-Otim J (2018) Big data. Proceedings of the 2018 International Conference on Software Engineering in Africa - SEiA 18, https://doi.org/10.1145/3195528.3195533.
Naderi H, Vinod P, Conti M, Parsa S, Alaeiyan MH (2019) Malware signature generation using locality sensitive hashing. Commun Comput Inf Sci Secur Privacy. https://doi.org/10.1007/978-981-13-7561-3_9
Oliveira A (2019) "Malware analysis datasets: Top-1000 PE imports. IEEE Dataport, https://doi.org/10.21227/004e-v304.
Oliver J, Ali M, & Hagen J (2020) HAC-T and Fast Search for Similarity in Security. 2020 International Conference on Omni-Layer Intelligent Systems (COINS). https://doi.org/10.1109/coins49042.2020.9191381
Pagani F, Dell'Amico M, and Balzarotti D (2018) Beyond Precision and recall: Understanding uses (and misuses) of similarity hashes in binary analysis. In Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy (CODASPY '18). Association for Computing Machinery, New York, NY, USA, 354–365. https://doi.org/10.1145/3176258.3176306.
Paola A De, and Lo Re G (2020) A hybrid system for malware detection on big data - IEEE Conference Publication. Accessed March 23. https://ieeexplore.ieee.org/document/8406963/.
Paranthaman R and Thuraisingham B (2017) Malware collection and analysis. 2017 IEEE International Conference on Information Reuse and Integration (IRI), San Diego, CA, pp 26–31 https://doi.org/10.1109/IRI.2017.92.
Poudyal S, Akhtar Z, Dasgupta D and Gupta KD (2019) Malware analytics: review of data mining, machine learning and big data perspectives. 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, pp 649-656, https://doi.org/10.1109/SSCI44817.2019.9002996
Rathore H, Agarwal S, Sahay SK, Sewak M (2019) Malware detection using machine learning and deep learning. arXiv.org https://arxiv.org/abs/1904.02441v1.
Serpanos D, Michalopoulos P, Xenos G, Ieronymakis V (2021) Sisyfos: A modular and extendable open malware analysis platform. Appl Sci 11(7):2980. https://doi.org/10.3390/app11072980
Smart Whitelisting Using Locality Sensitive Hashing (2017) Trend micro. https://www.trendmicro.com/en_us/research/17/c/smart-whitelisting-using-locality-sensitive-hashing.html
TLSH - Technical Overview. (2021) TLSH Technical Overview. https://tlsh.org/papers.html
Ullah F, Babar MA (2019) Architectural tactics for big data cybersecurity analytics systems: a review. J Syst Softw 151:81–118. https://doi.org/10.1016/j.jss.2019.01.051
Venkatraman S, Alazab M (2018) Use of data visualisation for zero-day malware detection. Secur Commun Netw 2018:1–13. https://doi.org/10.1155/2018/1728303
Vinayakumar R, Soman K (2018) Deepmalnet: evaluating shallow and deep networks for static pe malware detection. ICT Express 4(4):255–258
Vinayakumar R, Alazab M, Soman KP, Poornachandran P, Venkatraman S (2019) Robust intelligent malware detection using deep learning. IEEE Access 7(2019):46717–46738. https://doi.org/10.1109/access.2019.2906934
Wassermann S and Casas P (2018) Bigmomal. Proceedings of the 2018 Workshop on Traffic Measurements for Cybersecurity - WTMC 18, https://doi.org/10.1145/3229598.3229600.
Wu Q, Zhu X, Liu B (2021) A survey of android malware static detection technology based on machine learning. Mob Inf Syst 2021:1–18. https://doi.org/10.1155/2021/8896013
Ye Y, Li T, Adjeroh D, Iyengar SS West Virginia University, West Virginia University, Tao Li Florida International University, et al. A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR), 2017 https://doi.org/10.1145/3073559.
Yuxin D, Siyi Z (2019a) Malware detection based on deep learning algorithm. Neural Comput Appl 31(2):461–472
Yuxin D, Siyi Z (2019b) Malware detection based on deep learning algorithm. Neural Comput Appl 31:461–472. https://doi.org/10.1007/s00521-017-3077-6
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
The paper is authored by a single author, and all the works in the paper are carried out by him.
Corresponding author
Ethics declarations
Conflict of interest
The author hereby declares that they have no conflict of interest. No research grant or fund has been received from any agency to carry out the research work discussed in the manuscript.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Human animal and rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kumar, M. Scalable malware detection system using big data and distributed machine learning approach. Soft Comput 26, 3987–4003 (2022). https://doi.org/10.1007/s00500-021-06492-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06492-9