Skip to main content
Log in

An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

The emergence of advanced malware is a serious threat to information security. A prominent technique that identifies sophisticated malware should consider the runtime behaviour of the source file to detect malicious intent. Although the behaviour-based malware detection technique is a substantial improvement over the traditional signature-based detection technique, current malware employs code obfuscation techniques to elude detection. This paper presents the Hybrid Features-based malware detection system (HFMDS) that integrates static and dynamic features of the portable executable (PE) files to discern malware. The HFMDS is trained with prominent features advised by the filter-based feature selection technique (FST). The detection ability of the proposed HFMDS has evaluated with the random forest (RF) classifier by considering two different datasets that consist of real-world Windows malware samples. In-depth analysis is carried out to determine the optimal number of decision trees (DTs) required by the RF classifier to achieve consistent accuracy. Besides, four popular FSTs performance is also analyzed to determine which FST recommends the best features. From the experimental analysis, we can infer that increasing the number of DTs after 160 within the RF classifier does not make a significant difference in attaining better detection accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://code.google.com/p/pefile/, accessed on February 2018.

  2. https://github.com/urwithajit9/ClaMP, accessed on February 2018.

  3. https://github.com/guelfoweb/peframe/, accessed on February 2018.

  4. https://virustotal.com/en/statistics/, accessed on January 2018, amin2016survey

  5. https://virusshare.com/, accessed on January 2018.

  6. https://vx.netlux.org/, accessed on January 2018.

  7. http://download.cnet.com/, accessed on January 2018.

  8. http://www.onlinedown.net/, accessed on January 2018.

  9. https://github.com/rieck/malheur/commit/801410c653e41782d04139972e178288dd09bac1), accessed on January 2018.

References

  1. (2014) Cuckoo sandbox—automated malware analysis. https://cuckoosandbox.org/. Accessed Jan 2018

  2. Ahmadi M, Sami A, Rahimi H (2013) Yadegari B (2013) Malware detection by behavioural sequential patterns. Comput Fraud Secur 8:11–19

    Google Scholar 

  3. Alam MS, Vuong ST (2013) Random forest classification for detecting android malware. In: 2013 IEEE International Conference on green computing and communications and IEEE internet of things and IEEE cyber, physical and social computing. IEEE, pp 663–669. https://doi.org/10.1109/GreenCom-iThings-CPSCom.2013.122

  4. Aman N, Saleem Y, Abbasi FH, Shahzad F (2017) A hybrid approach for malware family classification. In: Applications and techniques in information security, vol 719. Springer, Singapore, pp 169–180. https://doi.org/10.1007/978-981-10-5421-1_14

    Google Scholar 

  5. Amin M (2016) A survey of financial losses due to malware. In: Proceedings of the second international conference on information and communication technology for competitive Strategies. ACM, pp 145:1–145:4. https://doi.org/10.1145/2905055.2905362

  6. Awan S, Saqib NA (2016) Detection of malicious executables using static and dynamic features of portable executable (PE) file. In: Security, privacy and anonymity in computation, communication and storage, vol 10067. Springer, Cham, pp 48–58. https://doi.org/10.1007/978-3-319-49145-5_6

    Google Scholar 

  7. Bai J, Wang J, Zou G (2014) A malware detection scheme based on mining format information. Sci World J 2014:1–11. https://doi.org/10.1155/2014/260905

    Article  Google Scholar 

  8. Baldangombo U, Jambaljav N, Horng SJ (2013) A static malware detection system using data mining methods. CoRR. arXiv:1308.2831

  9. Bayer U, Moser A, Kruegel C, Kirda E (2006) Dynamic analysis of malicious code. J Comput Virol 2(1):67–77

    Google Scholar 

  10. Belaoued M, Mazouzi S (2015) A real-time PE-malware detection system based on chi-square test and PE-file features. In: Computer science and its applications, vol 456. Springer, Cham, pp 416–425. https://doi.org/10.1007/978-3-319-19578-0_34

    Google Scholar 

  11. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Google Scholar 

  12. Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning: methods and applications. Springer, Boston, MA, pp 157–175. https://doi.org/10.1007/978-1-4419-9326-7_5

    Google Scholar 

  13. Damodaran A, Di Troia F, Visaggio CA, Austin TH, Stamp M (2017) A comparison of static, dynamic, and hybrid analysis for malware detection. J Comput Virol Hacking Tech 13(1):1–12

    Google Scholar 

  14. Darshan SLS, Kumara MAA, Jaidhar CD (2016) Windows malware detection based on cuckoo sandbox generated report using machine learning algorithm. In: 2016 11th international conference on industrial and information systems (ICIIS). IEEE, pp 534-539. https://doi.org/10.1109/ICIINFS.2016.8262998

  15. Das S, Liu Y, Zhang W, Chandramohan M (2016) Semantics-based online malware detection: towards efficient real-time protection against malware. IEEE Trans Inform For Secur 11(2):289–302

    Google Scholar 

  16. David B, Filiol E, Gallienne K (2017) Structural analysis of binary executable headers for malware detection optimization. J Comput Virol Hacking Tech 13(2):87–93

    Google Scholar 

  17. David OE, Netanyahu NS (2015) DeepSign: deep learning for automatic malware signature generation and classification. In: 2015 International joint conference on neural networks (IJCNN). IEEE, pp 1–8. https://doi.org/10.1109/IJCNN.2015.7280815

  18. Ding Y, Dai W, Yan S, Zhang Y (2014) Control flow-based opcode behavior analysis for malware detection. Comput Secur 44:65–74

    Google Scholar 

  19. Faruki P, Laxmi V, Gaur MS, Vinod P (2012) Behavioural detection with API call-grams to identify malicious PE files. In: Proceedings of the first international conference on security of internet of things. ACM, pp 85–91. https://doi.org/10.1145/2490428.2490440

  20. Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 1269–1277. https://doi.org/10.1007/978-0-387-09823-4_66

    Google Scholar 

  21. Gulgezen G, Cataltepe Z, Yu L (2009) Stable and accurate feature selection. In: Machine learning and knowledge discovery in databases, vol 5781. Springer, Berlin, Heidelberg, pp 455–468. https://doi.org/10.1007/978-3-642-04180-8_47

    Google Scholar 

  22. Huda S, Abawajy J, Alazab M, Abdollalihian M, Islam R, Yearwood J (2016) Hybrids of support vector machine wrapper and filter based framework for malware detection. Future Gener Comput Syst 55:376–390

    Google Scholar 

  23. Huda S, Islam R, Abawajy J, Yearwood J, Hassan MM, Fortino G (2018) A hybrid-multi filter-wrapper framework to identify run-time behaviour for fast malware detection. Future Gener Comput Sys 83:193–207

    Google Scholar 

  24. Islam R, Tian R, Batten LM, Versteeg S (2013) Classification of malware based on integrated static and dynamic features. J Netw Comput Appl 36(2):646–656

    Google Scholar 

  25. Jain A, Singh AK (2017) Integrated malware analysis using machine learning. In: 2017 2nd International conference on telecommunication and networks (TEL-NET). IEEE, pp 1–8. https://doi.org/10.1109/TEL-NET.2017.8343554

  26. Kawaguchi N, Omote K (2015) Malware function classification using APIs in initial behavior. In: 2015 10th Asia joint conference on information security. IEEE, pp 138–144. https://doi.org/10.1109/AsiaJCIS.2015.15

  27. Khoshgoftaar TM, Golawala M, Hulse JV (2007) An empirical study of learning from imbalanced data using random forest. In: 19th IEEE international conference on tools with artificial intelligence (ICTAI 2007), vol 2. IEEE, pp 310–317. https://doi.org/10.1109/ICTAI.2007.46

  28. Kolter JZ, Maloof MA (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744

    MathSciNet  MATH  Google Scholar 

  29. Kumar A, Kuppusamy K, Aghila G (2019) A learning model to detect maliciousness of portable executable using integrated feature set. J King Saud Univ Comput Inform Sci 31:252–265

    Google Scholar 

  30. Kumara MA, Jaidhar C (2017) Leveraging virtual machine introspection with memory forensics to detect and characterize unknown malware using machine learning techniques at hypervisor. Digit Investig 23:99–123

    Google Scholar 

  31. Lee J, Im C, Jeong H (2011) A study of malware detection and classification by comparing extracted strings. In: Proceedings of the 5th international conference on ubiquitous information management and communication. ACM, pp 75:1–75:4. http://doi.org/10.1145/1968613.1968704

  32. Leistner C, Saffari A, Santner J, Bischof H (2009) Semi-supervised random forests. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 506–513. https://doi.org/10.1109/ICCV.2009.5459198

  33. Lengyel TK, Maresca S, Payne BD, Webster GD, Vogl S, Kiayias A (2014) Scalability, fidelity and stealth in the DRAKVUF dynamic malware analysis system. In: Proceedings of the 30th annual computer security applications conference. ACM, pp 386–395. http://doi.org/10.1145/2664243.2664252

  34. Lin CH, Pao HK, Liao JW (2018) Efficient dynamic malware analysis using virtual time control mechanics. Comput Secur 73:359–373

    Google Scholar 

  35. Masud MM, Khan L, Thuraisingham B (2008) A scalable multi-level feature extraction technique to detect malicious executables. Inform Syst Front 10(1):33–45

    Google Scholar 

  36. Miller C, Glendowne D, Cook H, Thomas D, Lanclos C, Pape P (2017) Insights gained from constructing a large scale dynamic analysis platform. Digit Investig 22:S48–S56

    Google Scholar 

  37. Mohaisen A, Alrawi O, Mohaisen M (2015) Amal: High-fidelity, behavior-based automated malware analysis and classification. Comput Secur 52:251–266

    Google Scholar 

  38. Moser A, Kruegel C, Kirda E (2007) Limits of static analysis for malware detection. In: Twenty-third annual computer security applications conference (ACSAC 2007). IEEE, pp 421–430. http://doi.org/10.1109/ACSAC.2007.21

  39. Moskovitch R, Elovici Y, Rokach L (2008) Detection of unknown computer worms based on behavioral classification of the host. Comput Statist Data Anal 52(9):4544–4566

    MathSciNet  MATH  Google Scholar 

  40. Narouei M, Ahmadi M, Giacinto G, Takabi H, Sami A (2015) Dllminer: structural mining for malware detection. Secur Commun Netw 8(18):3311–3322

    Google Scholar 

  41. Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a Rando forest? In: Machine learning and data mining in pattern recognition, vol 7376. Springer, Berlin, Heidelberg, pp 154–168. https://doi.org/10.1007/978-3-642-31537-4_13

    Google Scholar 

  42. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Google Scholar 

  43. Qiao Y, Yang Y, He J, Tang C, Liu Z (2014) CBM: free, automatic malware analysis framework using API call sequences. In: Knowledge engineering and management, vol 214. Springer, Berlin, Heidelberg, pp 225–236. https://doi.org/10.1007/978-3-642-37832-4_21

    Google Scholar 

  44. Raff E, Zak R, Cox R, Sylvester J, Yacci P, Ward R, Tracy A, McLean M, Nicholas C (2016) An investigation of byte n-gram features for malware classification. J Comput Virol Hacking Tech 14(1):1–20

    Google Scholar 

  45. Reddy DKS, Pujari AK (2006) N-gram analysis for computer virus detection. J Comput Virol 2(3):231–239

    Google Scholar 

  46. Rieck K, Trinius P, Willems C, Holz T (2011) Automatic analysis of malware behavior using machine learning. J Comput Secur 19(4):639–668

    Google Scholar 

  47. Sakar CO, Kursun O, Gurgen F (2012) A feature selection method based on kernel canonical correlation analysis and the minimum redundancy-maximum relevance filter method. Expert Syst Appl 39(3):3432–3437

    Google Scholar 

  48. Salehi Z, Sami A (2014) Ghiasi M (2014) Using feature generation from api calls for malware detection. Comput Fraud Secur 9:9–18

    Google Scholar 

  49. Salehi Z, Sami A, Ghiasi M (2017) Maar: Robust features to detect malicious activity based on api calls, their arguments and return values. Eng Appl Artif Intell 59:93–102

    Google Scholar 

  50. Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG (2013) Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inform Sci 231:64–82

    MathSciNet  Google Scholar 

  51. Schultz MG, Eskin E, Zadok F, Stolfo SJ (2001) Data mining methods for detection of new malicious executables. In: Proceedings 2001 IEEE symposium on security and privacy. S&P 2001. IEEE, pp 38–49. https://doi.org/10.1109/SECPRI.2001.924286

  52. Sethi K, Chaudhary SK, Tripathy BK, Bera P (2018) A novel malware analysis framework for malware detection and classification using machine learning approach. In: Proceedings of the 19th international conference on distributed computing and networking. ACM, pp 49:1–49:4. http://doi.org/10.1145/3154273.3154326

  53. Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inform Secur Tech Rep 14(1):16–29

    Google Scholar 

  54. Shijo P, Salim A (2015) Integrated static and dynamic analysis for malware detection. Proc Comput Sci 46:804–811

    Google Scholar 

  55. Tsyganok K, Tumoyan E, Babenko L, Anikeev M (2012) Classification of polymorphic and metamorphic malware samples based on their behavior. In: Proceedings of the fifth international conference on security of information and networks. ACM, pp 111–116. http://doi.org/10.1145/2388576.2388591

  56. Vinod P, Laxmi V, Gaur MS (2011) Scattered feature space for malware analysis. In: Advances in computing and communications, vol 190. Springer, Berlin, Heidelberg, pp 562–571. https://doi.org/10.1007/978-3-642-22709-7_55

    Google Scholar 

  57. Vyas R, Luo X, McFarland N, Justice C (2017) Investigation of malicious portable executable file detection on the network using supervised learning techniques. In: 2017 IFIP/IEEE symposium on integrated network and service management (IM). IEEE, pp 941–946. https://doi.org/10.23919/INM.2017.7987416

  58. Willems C, Holz T, Freiling F (2007) Toward automated dynamic malware analysis using cwsandbox. IEEE Secur Priv 5(2):32–39

    Google Scholar 

  59. Yan G, Brown N, Kong D (2013) Exploring discriminatory features for automated malware classification. In: International conference on detection of intrusions and malware, and vulnerability assessment. Springer, New York, pp 41–61

    Google Scholar 

  60. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137

  61. Ye Y, Wang D, Li T, Ye D, Jiang Q (2008) An intelligent pe-malware detection system based on association mining. J Comput Virol 4(4):323–334

    Google Scholar 

  62. Ye Y, Chen L, Wang D, Li T, Jiang Q, Zhao M (2009) Sbmds: an interpretable string based malware detection system using svm ensemble with bagging. J Comput Virol 5(4):283–293

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. L. Shiva Darshan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Darshan, S.L.S., Jaidhar, C.D. An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique. Int. J. Mach. Learn. & Cyber. 11, 339–358 (2020). https://doi.org/10.1007/s13042-019-00978-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-019-00978-7

Keywords

Navigation