An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique

Darshan, S. L. Shiva; Jaidhar, C. D.

doi:10.1007/s13042-019-00978-7

An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique

Original Article
Published: 02 July 2019

Volume 11, pages 339–358, (2020)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

422 Accesses
9 Citations
Explore all metrics

Abstract

The emergence of advanced malware is a serious threat to information security. A prominent technique that identifies sophisticated malware should consider the runtime behaviour of the source file to detect malicious intent. Although the behaviour-based malware detection technique is a substantial improvement over the traditional signature-based detection technique, current malware employs code obfuscation techniques to elude detection. This paper presents the Hybrid Features-based malware detection system (HFMDS) that integrates static and dynamic features of the portable executable (PE) files to discern malware. The HFMDS is trained with prominent features advised by the filter-based feature selection technique (FST). The detection ability of the proposed HFMDS has evaluated with the random forest (RF) classifier by considering two different datasets that consist of real-world Windows malware samples. In-depth analysis is carried out to determine the optimal number of decision trees (DTs) required by the RF classifier to achieve consistent accuracy. Besides, four popular FSTs performance is also analyzed to determine which FST recommends the best features. From the experimental analysis, we can infer that increasing the number of DTs after 160 within the RF classifier does not make a significant difference in attaining better detection accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Malware Analysis Using Machine Learning

Detection of Malicious Executables Using Static and Dynamic Features of Portable Executable (PE) File

Malware Detection Using Machine Learning

Notes

https://code.google.com/p/pefile/, accessed on February 2018.
https://github.com/urwithajit9/ClaMP, accessed on February 2018.
https://github.com/guelfoweb/peframe/, accessed on February 2018.
https://virustotal.com/en/statistics/, accessed on January 2018, amin2016survey
https://virusshare.com/, accessed on January 2018.
https://vx.netlux.org/, accessed on January 2018.
http://download.cnet.com/, accessed on January 2018.
http://www.onlinedown.net/, accessed on January 2018.
https://github.com/rieck/malheur/commit/801410c653e41782d04139972e178288dd09bac1), accessed on January 2018.

References

(2014) Cuckoo sandbox—automated malware analysis. https://cuckoosandbox.org/. Accessed Jan 2018
Ahmadi M, Sami A, Rahimi H (2013) Yadegari B (2013) Malware detection by behavioural sequential patterns. Comput Fraud Secur 8:11–19
Google Scholar
Alam MS, Vuong ST (2013) Random forest classification for detecting android malware. In: 2013 IEEE International Conference on green computing and communications and IEEE internet of things and IEEE cyber, physical and social computing. IEEE, pp 663–669. https://doi.org/10.1109/GreenCom-iThings-CPSCom.2013.122
Aman N, Saleem Y, Abbasi FH, Shahzad F (2017) A hybrid approach for malware family classification. In: Applications and techniques in information security, vol 719. Springer, Singapore, pp 169–180. https://doi.org/10.1007/978-981-10-5421-1_14
Google Scholar
Amin M (2016) A survey of financial losses due to malware. In: Proceedings of the second international conference on information and communication technology for competitive Strategies. ACM, pp 145:1–145:4. https://doi.org/10.1145/2905055.2905362
Awan S, Saqib NA (2016) Detection of malicious executables using static and dynamic features of portable executable (PE) file. In: Security, privacy and anonymity in computation, communication and storage, vol 10067. Springer, Cham, pp 48–58. https://doi.org/10.1007/978-3-319-49145-5_6
Google Scholar
Bai J, Wang J, Zou G (2014) A malware detection scheme based on mining format information. Sci World J 2014:1–11. https://doi.org/10.1155/2014/260905
Article Google Scholar
Baldangombo U, Jambaljav N, Horng SJ (2013) A static malware detection system using data mining methods. CoRR. arXiv:1308.2831
Bayer U, Moser A, Kruegel C, Kirda E (2006) Dynamic analysis of malicious code. J Comput Virol 2(1):67–77
Google Scholar
Belaoued M, Mazouzi S (2015) A real-time PE-malware detection system based on chi-square test and PE-file features. In: Computer science and its applications, vol 456. Springer, Cham, pp 416–425. https://doi.org/10.1007/978-3-319-19578-0_34
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
MATH Google Scholar
Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Ensemble machine learning: methods and applications. Springer, Boston, MA, pp 157–175. https://doi.org/10.1007/978-1-4419-9326-7_5
Google Scholar
Damodaran A, Di Troia F, Visaggio CA, Austin TH, Stamp M (2017) A comparison of static, dynamic, and hybrid analysis for malware detection. J Comput Virol Hacking Tech 13(1):1–12
Google Scholar
Darshan SLS, Kumara MAA, Jaidhar CD (2016) Windows malware detection based on cuckoo sandbox generated report using machine learning algorithm. In: 2016 11th international conference on industrial and information systems (ICIIS). IEEE, pp 534-539. https://doi.org/10.1109/ICIINFS.2016.8262998
Das S, Liu Y, Zhang W, Chandramohan M (2016) Semantics-based online malware detection: towards efficient real-time protection against malware. IEEE Trans Inform For Secur 11(2):289–302
Google Scholar
David B, Filiol E, Gallienne K (2017) Structural analysis of binary executable headers for malware detection optimization. J Comput Virol Hacking Tech 13(2):87–93
Google Scholar
David OE, Netanyahu NS (2015) DeepSign: deep learning for automatic malware signature generation and classification. In: 2015 International joint conference on neural networks (IJCNN). IEEE, pp 1–8. https://doi.org/10.1109/IJCNN.2015.7280815
Ding Y, Dai W, Yan S, Zhang Y (2014) Control flow-based opcode behavior analysis for malware detection. Comput Secur 44:65–74
Google Scholar
Faruki P, Laxmi V, Gaur MS, Vinod P (2012) Behavioural detection with API call-grams to identify malicious PE files. In: Proceedings of the first international conference on security of internet of things. ACM, pp 85–91. https://doi.org/10.1145/2490428.2490440
Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, Witten IH, Trigg L (2010) Weka-a machine learning workbench for data mining. In: Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 1269–1277. https://doi.org/10.1007/978-0-387-09823-4_66
Google Scholar
Gulgezen G, Cataltepe Z, Yu L (2009) Stable and accurate feature selection. In: Machine learning and knowledge discovery in databases, vol 5781. Springer, Berlin, Heidelberg, pp 455–468. https://doi.org/10.1007/978-3-642-04180-8_47
Google Scholar
Huda S, Abawajy J, Alazab M, Abdollalihian M, Islam R, Yearwood J (2016) Hybrids of support vector machine wrapper and filter based framework for malware detection. Future Gener Comput Syst 55:376–390
Google Scholar
Huda S, Islam R, Abawajy J, Yearwood J, Hassan MM, Fortino G (2018) A hybrid-multi filter-wrapper framework to identify run-time behaviour for fast malware detection. Future Gener Comput Sys 83:193–207
Google Scholar
Islam R, Tian R, Batten LM, Versteeg S (2013) Classification of malware based on integrated static and dynamic features. J Netw Comput Appl 36(2):646–656
Google Scholar
Jain A, Singh AK (2017) Integrated malware analysis using machine learning. In: 2017 2nd International conference on telecommunication and networks (TEL-NET). IEEE, pp 1–8. https://doi.org/10.1109/TEL-NET.2017.8343554
Kawaguchi N, Omote K (2015) Malware function classification using APIs in initial behavior. In: 2015 10th Asia joint conference on information security. IEEE, pp 138–144. https://doi.org/10.1109/AsiaJCIS.2015.15
Khoshgoftaar TM, Golawala M, Hulse JV (2007) An empirical study of learning from imbalanced data using random forest. In: 19th IEEE international conference on tools with artificial intelligence (ICTAI 2007), vol 2. IEEE, pp 310–317. https://doi.org/10.1109/ICTAI.2007.46
Kolter JZ, Maloof MA (2006) Learning to detect and classify malicious executables in the wild. J Mach Learn Res 7:2721–2744
MathSciNet MATH Google Scholar
Kumar A, Kuppusamy K, Aghila G (2019) A learning model to detect maliciousness of portable executable using integrated feature set. J King Saud Univ Comput Inform Sci 31:252–265
Google Scholar
Kumara MA, Jaidhar C (2017) Leveraging virtual machine introspection with memory forensics to detect and characterize unknown malware using machine learning techniques at hypervisor. Digit Investig 23:99–123
Google Scholar
Lee J, Im C, Jeong H (2011) A study of malware detection and classification by comparing extracted strings. In: Proceedings of the 5th international conference on ubiquitous information management and communication. ACM, pp 75:1–75:4. http://doi.org/10.1145/1968613.1968704
Leistner C, Saffari A, Santner J, Bischof H (2009) Semi-supervised random forests. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 506–513. https://doi.org/10.1109/ICCV.2009.5459198
Lengyel TK, Maresca S, Payne BD, Webster GD, Vogl S, Kiayias A (2014) Scalability, fidelity and stealth in the DRAKVUF dynamic malware analysis system. In: Proceedings of the 30th annual computer security applications conference. ACM, pp 386–395. http://doi.org/10.1145/2664243.2664252
Lin CH, Pao HK, Liao JW (2018) Efficient dynamic malware analysis using virtual time control mechanics. Comput Secur 73:359–373
Google Scholar
Masud MM, Khan L, Thuraisingham B (2008) A scalable multi-level feature extraction technique to detect malicious executables. Inform Syst Front 10(1):33–45
Google Scholar
Miller C, Glendowne D, Cook H, Thomas D, Lanclos C, Pape P (2017) Insights gained from constructing a large scale dynamic analysis platform. Digit Investig 22:S48–S56
Google Scholar
Mohaisen A, Alrawi O, Mohaisen M (2015) Amal: High-fidelity, behavior-based automated malware analysis and classification. Comput Secur 52:251–266
Google Scholar
Moser A, Kruegel C, Kirda E (2007) Limits of static analysis for malware detection. In: Twenty-third annual computer security applications conference (ACSAC 2007). IEEE, pp 421–430. http://doi.org/10.1109/ACSAC.2007.21
Moskovitch R, Elovici Y, Rokach L (2008) Detection of unknown computer worms based on behavioral classification of the host. Comput Statist Data Anal 52(9):4544–4566
MathSciNet MATH Google Scholar
Narouei M, Ahmadi M, Giacinto G, Takabi H, Sami A (2015) Dllminer: structural mining for malware detection. Secur Commun Netw 8(18):3311–3322
Google Scholar
Oshiro TM, Perez PS, Baranauskas JA (2012) How many trees in a Rando forest? In: Machine learning and data mining in pattern recognition, vol 7376. Springer, Berlin, Heidelberg, pp 154–168. https://doi.org/10.1007/978-3-642-31537-4_13
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Google Scholar
Qiao Y, Yang Y, He J, Tang C, Liu Z (2014) CBM: free, automatic malware analysis framework using API call sequences. In: Knowledge engineering and management, vol 214. Springer, Berlin, Heidelberg, pp 225–236. https://doi.org/10.1007/978-3-642-37832-4_21
Google Scholar
Raff E, Zak R, Cox R, Sylvester J, Yacci P, Ward R, Tracy A, McLean M, Nicholas C (2016) An investigation of byte n-gram features for malware classification. J Comput Virol Hacking Tech 14(1):1–20
Google Scholar
Reddy DKS, Pujari AK (2006) N-gram analysis for computer virus detection. J Comput Virol 2(3):231–239
Google Scholar
Rieck K, Trinius P, Willems C, Holz T (2011) Automatic analysis of malware behavior using machine learning. J Comput Secur 19(4):639–668
Google Scholar
Sakar CO, Kursun O, Gurgen F (2012) A feature selection method based on kernel canonical correlation analysis and the minimum redundancy-maximum relevance filter method. Expert Syst Appl 39(3):3432–3437
Google Scholar
Salehi Z, Sami A (2014) Ghiasi M (2014) Using feature generation from api calls for malware detection. Comput Fraud Secur 9:9–18
Google Scholar
Salehi Z, Sami A, Ghiasi M (2017) Maar: Robust features to detect malicious activity based on api calls, their arguments and return values. Eng Appl Artif Intell 59:93–102
Google Scholar
Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG (2013) Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inform Sci 231:64–82
MathSciNet Google Scholar
Schultz MG, Eskin E, Zadok F, Stolfo SJ (2001) Data mining methods for detection of new malicious executables. In: Proceedings 2001 IEEE symposium on security and privacy. S&P 2001. IEEE, pp 38–49. https://doi.org/10.1109/SECPRI.2001.924286
Sethi K, Chaudhary SK, Tripathy BK, Bera P (2018) A novel malware analysis framework for malware detection and classification using machine learning approach. In: Proceedings of the 19th international conference on distributed computing and networking. ACM, pp 49:1–49:4. http://doi.org/10.1145/3154273.3154326
Shabtai A, Moskovitch R, Elovici Y, Glezer C (2009) Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inform Secur Tech Rep 14(1):16–29
Google Scholar
Shijo P, Salim A (2015) Integrated static and dynamic analysis for malware detection. Proc Comput Sci 46:804–811
Google Scholar
Tsyganok K, Tumoyan E, Babenko L, Anikeev M (2012) Classification of polymorphic and metamorphic malware samples based on their behavior. In: Proceedings of the fifth international conference on security of information and networks. ACM, pp 111–116. http://doi.org/10.1145/2388576.2388591
Vinod P, Laxmi V, Gaur MS (2011) Scattered feature space for malware analysis. In: Advances in computing and communications, vol 190. Springer, Berlin, Heidelberg, pp 562–571. https://doi.org/10.1007/978-3-642-22709-7_55
Google Scholar
Vyas R, Luo X, McFarland N, Justice C (2017) Investigation of malicious portable executable file detection on the network using supervised learning techniques. In: 2017 IFIP/IEEE symposium on integrated network and service management (IM). IEEE, pp 941–946. https://doi.org/10.23919/INM.2017.7987416
Willems C, Holz T, Freiling F (2007) Toward automated dynamic malware analysis using cwsandbox. IEEE Secur Priv 5(2):32–39
Google Scholar
Yan G, Brown N, Kong D (2013) Exploring discriminatory features for automated malware classification. In: International conference on detection of intrusions and malware, and vulnerability assessment. Springer, New York, pp 41–61
Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Ye Y, Wang D, Li T, Ye D, Jiang Q (2008) An intelligent pe-malware detection system based on association mining. J Comput Virol 4(4):323–334
Google Scholar
Ye Y, Chen L, Wang D, Li T, Jiang Q, Zhao M (2009) Sbmds: an interpretable string based malware detection system using svm ensemble with bagging. J Comput Virol 5(4):283–293
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, National Institute of Technology Karnataka, Surathkal, India
S. L. Shiva Darshan & C. D. Jaidhar

Authors

S. L. Shiva Darshan
View author publications
You can also search for this author in PubMed Google Scholar
C. D. Jaidhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. L. Shiva Darshan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Darshan, S.L.S., Jaidhar, C.D. An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique. Int. J. Mach. Learn. & Cyber. 11, 339–358 (2020). https://doi.org/10.1007/s13042-019-00978-7

Download citation

Received: 04 May 2018
Accepted: 25 June 2019
Published: 02 July 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s13042-019-00978-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique

Abstract

Access this article

Similar content being viewed by others

Malware Analysis Using Machine Learning

Detection of Malicious Executables Using Static and Dynamic Features of Portable Executable (PE) File

Malware Detection Using Machine Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An empirical study to estimate the stability of random forest classifier on the hybrid features recommended by filter based feature selection technique

Abstract

Access this article

Similar content being viewed by others

Malware Analysis Using Machine Learning

Detection of Malicious Executables Using Static and Dynamic Features of Portable Executable (PE) File

Malware Detection Using Machine Learning

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation