Abstract
The emergence of advanced malware is a serious threat to information security. A prominent technique that identifies sophisticated malware should consider the runtime behaviour of the source file to detect malicious intent. Although the behaviour-based malware detection technique is a substantial improvement over the traditional signature-based detection technique, current malware employs code obfuscation techniques to elude detection. This paper presents the Hybrid Features-based malware detection system (HFMDS) that integrates static and dynamic features of the portable executable (PE) files to discern malware. The HFMDS is trained with prominent features advised by the filter-based feature selection technique (FST). The detection ability of the proposed HFMDS has evaluated with the random forest (RF) classifier by considering two different datasets that consist of real-world Windows malware samples. In-depth analysis is carried out to determine the optimal number of decision trees (DTs) required by the RF classifier to achieve consistent accuracy. Besides, four popular FSTs performance is also analyzed to determine which FST recommends the best features. From the experimental analysis, we can infer that increasing the number of DTs after 160 within the RF classifier does not make a significant difference in attaining better detection accuracy.
Similar content being viewed by others
Notes
https://code.google.com/p/pefile/, accessed on February 2018.
https://github.com/urwithajit9/ClaMP, accessed on February 2018.
https://github.com/guelfoweb/peframe/, accessed on February 2018.
https://virustotal.com/en/statistics/, accessed on January 2018, amin2016survey
https://virusshare.com/, accessed on January 2018.
https://vx.netlux.org/, accessed on January 2018.
http://download.cnet.com/, accessed on January 2018.
http://www.onlinedown.net/, accessed on January 2018.
https://github.com/rieck/malheur/commit/801410c653e41782d04139972e178288dd09bac1), accessed on January 2018.
References
(2014) Cuckoo sandbox—automated malware analysis. https://cuckoosandbox.org/. Accessed Jan 2018
Ahmadi M, Sami A, Rahimi H (2013) Yadegari B (2013) Malware detection by behavioural sequential patterns. Comput Fraud Secur 8:11–19
Alam MS, Vuong ST (2013) Random forest classification for detecting android malware. In: 2013 IEEE International Conference on green computing and communications and IEEE internet of things and IEEE cyber, physical and social computing. IEEE, pp 663–669. https://doi.org/10.1109/GreenCom-iThings-CPSCom.2013.122
Aman N, Saleem Y, Abbasi FH, Shahzad F (2017) A hybrid approach for malware family classification. In: Applications and techniques in information security, vol 719. Springer, Singapore, pp 169–180. https://doi.org/10.1007/978-981-10-5421-1_14
Amin M (2016) A survey of financial losses due to malware. In: Proceedings of the second international conference on information and communication technology for competitive Strategies. ACM, pp 145:1–145:4. https://doi.org/10.1145/2905055.2905362
Awan S, Saqib NA (2016) Detection of malicious executables using static and dynamic features of portable executable (PE) file. In: Security, privacy and anonymity in computation, communication and storage, vol 10067. Springer, Cham, pp 48–58. https://doi.org/10.1007/978-3-319-49145-5_6
Bai J, Wang J, Zou G (2014) A malware detection scheme based on mining format information. Sci World J 2014:1–11. https://doi.org/10.1155/2014/260905
Baldangombo U, Jambaljav N, Horng SJ (2013) A static malware detection system using data mining methods. CoRR. arXiv:1308.2831
Bayer U, Moser A, Kruegel C, Kirda E (2006) Dynamic analysis of malicious code. J Comput Virol 2(1):67–77
Belaoued M, Mazouzi S (2015) A real-time PE-malware detection system based on chi-square test and PE-file features. In: Computer science and its applications, vol 456. Springer, Cham, pp 416–425. https://doi.org/10.1007/978-3-319-19578-0_34
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning: methods and applications. Springer, Boston, MA, pp 157–175. https://doi.org/10.1007/978-1-4419-9326-7_5
Damodaran A, Di Troia F, Visaggio CA, Austin TH, Stamp M (2017) A comparison of static, dynamic, and hybrid analysis for malware detection. J Comput Virol Hacking Tech 13(1):1–12
Darshan SLS, Kumara MAA, Jaidhar CD (2016) Windows malware detection based on cuckoo sandbox generated report using machine learning algorithm. In: 2016 11th international conference on industrial and information systems (ICIIS). IEEE, pp 534-539. https://doi.org/10.1109/ICIINFS.2016.8262998
Das S, Liu Y, Zhang W, Chandramohan M (2016) Semantics-based online malware detection: towards efficient real-time protection against malware. IEEE Trans Inform For Secur 11(2):289–302
David B, Filiol E, Gallienne K (2017) Structural analysis of binary executable headers for malware detection optimization. J Comput Virol Hacking Tech 13(2):87–93
David OE, Netanyahu NS (2015) DeepSign: deep learning for automatic malware signature generation and classification. In: 2015 International joint conference on neural networks (IJCNN). IEEE, pp 1–8. https://doi.org/10.1109/IJCNN.2015.7280815
Ding Y, Dai W, Yan S, Zhang Y (2014) Control flow-based opcode behavior analysis for malware detection. Comput Secur 44:65–74
Faruki P, Laxmi V, Gaur MS, Vinod P (2012) Behavioural detection with API call-grams to identify malicious PE files. In: Proceedings of the first international conference on security of internet of things. ACM, pp 85–91. https://doi.org/10.1145/2490428.2490440
Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 1269–1277. https://doi.org/10.1007/978-0-387-09823-4_66
Gulgezen G, Cataltepe Z, Yu L (2009) Stable and accurate feature selection. In: Machine learning and knowledge discovery in databases, vol 5781. Springer, Berlin, Heidelberg, pp 455–468. https://doi.org/10.1007/978-3-642-04180-8_47
Huda S, Abawajy J, Alazab M, Abdollalihian M, Islam R, Yearwood J (2016) Hybrids of support vector machine wrapper and filter based framework for malware detection. Future Gener Comput Syst 55:376–390
Huda S, Islam R, Abawajy J, Yearwood J, Hassan MM, Fortino G (2018) A hybrid-multi filter-wrapper framework to identify run-time behaviour for fast malware detection. Future Gener Comput Sys 83:193–207
Islam R, Tian R, Batten LM, Versteeg S (2013) Classification of malware based on integrated static and dynamic features. J Netw Comput Appl 36(2):646–656
Jain A, Singh AK (2017) Integrated malware analysis using machine learning. In: 2017 2nd International conference on telecommunication and networks (TEL-NET). IEEE, pp 1–8. https://doi.org/10.1109/TEL-NET.2017.8343554
Kawaguchi N, Omote K (2015) Malware function classification using APIs in initial behavior. In: 2015 10th Asia joint conference on information security. IEEE, pp 138–144. https://doi.org/10.1109/AsiaJCIS.2015.15
Khoshgoftaar TM, Golawala M, Hulse JV (2007) An empirical study of learning from imbalanced data using random forest. In: 19th IEEE international conference on tools with artificial intelligence (ICTAI 2007), vol 2. IEEE, pp 310–317. https://doi.org/10.1109/ICTAI.2007.46
Kolter JZ, Maloof MA (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744
Kumar A, Kuppusamy K, Aghila G (2019) A learning model to detect maliciousness of portable executable using integrated feature set. J King Saud Univ Comput Inform Sci 31:252–265
Kumara MA, Jaidhar C (2017) Leveraging virtual machine introspection with memory forensics to detect and characterize unknown malware using machine learning techniques at hypervisor. Digit Investig 23:99–123
Lee J, Im C, Jeong H (2011) A study of malware detection and classification by comparing extracted strings. In: Proceedings of the 5th international conference on ubiquitous information management and communication. ACM, pp 75:1–75:4. http://doi.org/10.1145/1968613.1968704
Leistner C, Saffari A, Santner J, Bischof H (2009) Semi-supervised random forests. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 506–513. https://doi.org/10.1109/ICCV.2009.5459198
Lengyel TK, Maresca S, Payne BD, Webster GD, Vogl S, Kiayias A (2014) Scalability, fidelity and stealth in the DRAKVUF dynamic malware analysis system. In: Proceedings of the 30th annual computer security applications conference. ACM, pp 386–395. http://doi.org/10.1145/2664243.2664252
Lin CH, Pao HK, Liao JW (2018) Efficient dynamic malware analysis using virtual time control mechanics. Comput Secur 73:359–373
Masud MM, Khan L, Thuraisingham B (2008) A scalable multi-level feature extraction technique to detect malicious executables. Inform Syst Front 10(1):33–45
Miller C, Glendowne D, Cook H, Thomas D, Lanclos C, Pape P (2017) Insights gained from constructing a large scale dynamic analysis platform. Digit Investig 22:S48–S56
Mohaisen A, Alrawi O, Mohaisen M (2015) Amal: High-fidelity, behavior-based automated malware analysis and classification. Comput Secur 52:251–266
Moser A, Kruegel C, Kirda E (2007) Limits of static analysis for malware detection. In: Twenty-third annual computer security applications conference (ACSAC 2007). IEEE, pp 421–430. http://doi.org/10.1109/ACSAC.2007.21
Moskovitch R, Elovici Y, Rokach L (2008) Detection of unknown computer worms based on behavioral classification of the host. Comput Statist Data Anal 52(9):4544–4566
Narouei M, Ahmadi M, Giacinto G, Takabi H, Sami A (2015) Dllminer: structural mining for malware detection. Secur Commun Netw 8(18):3311–3322
Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a Rando forest? In: Machine learning and data mining in pattern recognition, vol 7376. Springer, Berlin, Heidelberg, pp 154–168. https://doi.org/10.1007/978-3-642-31537-4_13
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Qiao Y, Yang Y, He J, Tang C, Liu Z (2014) CBM: free, automatic malware analysis framework using API call sequences. In: Knowledge engineering and management, vol 214. Springer, Berlin, Heidelberg, pp 225–236. https://doi.org/10.1007/978-3-642-37832-4_21
Raff E, Zak R, Cox R, Sylvester J, Yacci P, Ward R, Tracy A, McLean M, Nicholas C (2016) An investigation of byte n-gram features for malware classification. J Comput Virol Hacking Tech 14(1):1–20
Reddy DKS, Pujari AK (2006) N-gram analysis for computer virus detection. J Comput Virol 2(3):231–239
Rieck K, Trinius P, Willems C, Holz T (2011) Automatic analysis of malware behavior using machine learning. J Comput Secur 19(4):639–668
Sakar CO, Kursun O, Gurgen F (2012) A feature selection method based on kernel canonical correlation analysis and the minimum redundancy-maximum relevance filter method. Expert Syst Appl 39(3):3432–3437
Salehi Z, Sami A (2014) Ghiasi M (2014) Using feature generation from api calls for malware detection. Comput Fraud Secur 9:9–18
Salehi Z, Sami A, Ghiasi M (2017) Maar: Robust features to detect malicious activity based on api calls, their arguments and return values. Eng Appl Artif Intell 59:93–102
Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG (2013) Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inform Sci 231:64–82
Schultz MG, Eskin E, Zadok F, Stolfo SJ (2001) Data mining methods for detection of new malicious executables. In: Proceedings 2001 IEEE symposium on security and privacy. S&P 2001. IEEE, pp 38–49. https://doi.org/10.1109/SECPRI.2001.924286
Sethi K, Chaudhary SK, Tripathy BK, Bera P (2018) A novel malware analysis framework for malware detection and classification using machine learning approach. In: Proceedings of the 19th international conference on distributed computing and networking. ACM, pp 49:1–49:4. http://doi.org/10.1145/3154273.3154326
Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inform Secur Tech Rep 14(1):16–29
Shijo P, Salim A (2015) Integrated static and dynamic analysis for malware detection. Proc Comput Sci 46:804–811
Tsyganok K, Tumoyan E, Babenko L, Anikeev M (2012) Classification of polymorphic and metamorphic malware samples based on their behavior. In: Proceedings of the fifth international conference on security of information and networks. ACM, pp 111–116. http://doi.org/10.1145/2388576.2388591
Vinod P, Laxmi V, Gaur MS (2011) Scattered feature space for malware analysis. In: Advances in computing and communications, vol 190. Springer, Berlin, Heidelberg, pp 562–571. https://doi.org/10.1007/978-3-642-22709-7_55
Vyas R, Luo X, McFarland N, Justice C (2017) Investigation of malicious portable executable file detection on the network using supervised learning techniques. In: 2017 IFIP/IEEE symposium on integrated network and service management (IM). IEEE, pp 941–946. https://doi.org/10.23919/INM.2017.7987416
Willems C, Holz T, Freiling F (2007) Toward automated dynamic malware analysis using cwsandbox. IEEE Secur Priv 5(2):32–39
Yan G, Brown N, Kong D (2013) Exploring discriminatory features for automated malware classification. In: International conference on detection of intrusions and malware, and vulnerability assessment. Springer, New York, pp 41–61
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Ye Y, Wang D, Li T, Ye D, Jiang Q (2008) An intelligent pe-malware detection system based on association mining. J Comput Virol 4(4):323–334
Ye Y, Chen L, Wang D, Li T, Jiang Q, Zhao M (2009) Sbmds: an interpretable string based malware detection system using svm ensemble with bagging. J Comput Virol 5(4):283–293
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Darshan, S.L.S., Jaidhar, C.D. An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique. Int. J. Mach. Learn. & Cyber. 11, 339–358 (2020). https://doi.org/10.1007/s13042-019-00978-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-019-00978-7