SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Ye, Yanfang; Chen, Lifei; Wang, Dingding; Li, Tao; Jiang, Qingshan; Zhao, Min

doi:10.1007/s11416-008-0108-y

SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Original Paper
Published: 26 November 2008

Volume 5, pages 283–293, (2009)
Cite this article

Journal in Computer Virology Aims and scope Submit manuscript

Yanfang Ye¹,
Lifei Chen²,
Dingding Wang³,
Tao Li³,
Qingshan Jiang⁴ &
…
Min Zhao⁵

560 Accesses
63 Citations
6 Altmetric
Explore all metrics

Abstract

Malicious executables are programs designed to infiltrate or damage a computer system without the owner’s consent, which have become a serious threat to the security of computer systems. There is an urgent need for effective techniques to detect polymorphic, metamorphic and previously unseen malicious executables of which detection fails in most of the commercial anti-virus software. In this paper, we develop interpretable string based malware detection system (SBMDS), which is based on interpretable string analysis and uses support vector machine (SVM) ensemble with Bagging to classify the file samples and predict the exact types of the malware. Interpretable strings contain both application programming interface (API) execution calls and important semantic strings reflecting an attacker’s intent and goal. Our SBMDS is carried out with four major steps: (1) first constructing the interpretable strings by developing a feature parser; (2) performing feature selection to select informative strings related to different types of malware; (3) followed by using SVM ensemble with bagging to construct the classifier; (4) and finally conducting the malware detector, which not only can detect whether a program is malicious or not, but also can predict the exact type of the malware. Our case study on the large collection of file samples collected by Kingsoft Anti-virus lab illustrate that: (1) The accuracy and efficiency of our SBMDS outperform several popular anti-virus software; (2) Based on the signatures of interpretable strings, our SBMDS outperforms data mining based detection systems which employ single SVM, Naive Bayes with bagging, Decision Trees with bagging; (3) Compared with the IMDS which utilizes the objective-oriented association (OOA) based classification on API calls, our SBMDS achieves better performance. Our SBMDS system has already been incorporated into the scanning tool of a commercial anti-virus software.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Malware Detection Using API Function Frequency with Ensemble Based Classifier

Malware Detection Using API Function Calls

Smart Boosted Model for Behavior-Based Malware Analysis and Detection

References

Adleman, L.: An abstract theory of computer viruses (invited talk). In: CRYPTO ’88: Proceedings on Advances in cryptology, pp. 354–374. Springer, New York (1990)
Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. RAID 2007. LNCS, vol. 4637, pp 178–197 (2007)
Bayer U., Moser A., Kruegel C., Kirda E.: Dynamic analysis of malicious code. J. Comput. Virol. 2, 67–77 (2006)
Article Google Scholar
Beaucamps P., Filiol E.: Metamorphism, formal grammars and undecidable code mutation. J. Comp. Sci. 2(1), 70–75 (2007)
Google Scholar
Beaucamps, P., Filiol, E.: On the possibility of practically obfuscating programs towards aunified perspective of code protection. J. Comp. Virol. 3(1), 2007
Bowd C., Medeiros F.A., Zhang Z., Zangwill L.M., Hao J., Lee T., Sejnowski T.J., Weinreb R.N., Goldbaum M.H.: Relevance vector machine and support vector machine classifier analysis of scanning laser polarimetry retinal nerve fiber layer measurements. Invest. Ophthalmol. Vis. Sci. 46, 1322–1329 (2005)
Article Google Scholar
Breiman L.: Bagging predicators. Mach. Learn. 24, 123–140 (1996)
MATH MathSciNet Google Scholar
Christodorescu, M., Jha, S., Kruegel, C.: Mining specifications of malicious behavior. In Proceedings of ESEC/FSE07, pp 5–14 (2007)
Dietterich T.G.: Machine learning research: Four current directions. AI Magaz. 18(4), 97–136 (1997)
Google Scholar
Filiol E.: Computer Viruses: from Theory to Applications. Springer, Heidelberg (2005)
MATH Google Scholar
Filiol E.: Malware pattern scanning schemes secure against black-box analysis. J. Comp. Virol. 2(1), 35–50 (2006)
Article Google Scholar
Filiol E., Jacob G., Liard M.L.: Evaluation methodology and theoretical model for antiviral behavioural detection strategies. J. Comp. Virol. 3(1), 27–37 (2007)
Google Scholar
Freund Y., Schapire R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comp. Syst. Sci. 55(1), 119–139 (1997)
Article MATH MathSciNet Google Scholar
Hsu C., Lin C.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13, 415–425 (2002)
Article Google Scholar
Kim, H., Pang, S., Je, H., Kim, D., Bang, S.: Support vector machine ensemble with bagging. SVM 2002, LNCSI, vol. 2388, pp 397–408 (2002)
Kolcz, A., Sun, X., Kalita, J.: Efficient handling of high-dimensional feature spaces by randomized classifier ensembles. In: Proceedings of KDD’02 (2002)
Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of KDD’04 (2004)
Li Y., Campbell C., Tipping M.: Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics 18, 1232–1239 (2002)
Google Scholar
Li, D., Hu, W.: Feature selection with rvm and its application to prediction modeling. AI 2006, LNAI, vol. 4304, pp 1140–1144 (2006)
McGraw G., Morrisett G.: Attacking malicious code:report to the infosec research council. IEEE Softw. 17(5), 33–41 (2000)
Article Google Scholar
Oza, N.C., Russell, S.: Experimental comparisons of online and batch versions of bagging and boosting. In: Proceedings of KDD’01 (2001)
Rangel, P., Lozano, F., Garcia, E.: Boosting of support vector machines with application to editing. In: Proceedings of ICMLA’05 (2005)
Reddy D.K.S., Pujari A.K.: N-gram analysis for computer virus detection. J. Comput. Virol. 2, 231–239 (2006)
Article Google Scholar
Schultz, M., Eskin, E., Zadok, E.: Data mining methods for detection of new malicious executables. In: Security and privacy, 2001. Proceedings of 2001 IEEE Symposium on 14–16 May, pp 38–49 (2001)
Sebastiani F.: Text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Silva, C., Ribeiro, B., Sung, A.H.: Boosting rvm classifiers for large data sets. ICANNGA 2007, Part II, LNCSI, vol. 4432, pp 228–237 (2007)
Sung, A., Xu, J., Chavez, P., Mukkamala, S.: Static analyzer of vicious executables (save). In: Proceedings of the 20th Annual Computer Security Applications Conference (2004)
Tan, S., Cheng, X., Ghanem, M., Wang, B., Xu, H.: A novel refinement approach for text categorization. In: Proceeding of the ACM CIKM, pp 469–476, 2005
Tipping M.: Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–214 (2001)
Article MATH MathSciNet Google Scholar
Tsang I.W., Kwok J.T., Cheung P.M.: Core vector machines: Fast svm training on very large data sets. J. Mach. Learn. Res. 6, 363–392 (2005)
MathSciNet Google Scholar
Wang, J., Deng, P., Fan, Y., Jaw, L., Liu, Y.: Virus detection using data mining techniques. In: Proceedings of IEEE International Conference on Data Mining (2003)
Wickramaratna, J., Holden, S.B., Buxton, B.F.: Performance degradation in boosting. In: Proceedings of the Second International Workshop on Multiple Classifier Systems (2001)
Witten H., Frank E.: Data mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, Menlo Park (2005)
Google Scholar
Ye, Y., Wang, D., Li, T., Ye, D.: IMDS: Intelligent malware detection system. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2007) (2007)
Yu, H., Yang, J., Han, J.: Classifying large data sets using svms with hierarchical clusters. In: Proceedings of KDD’03 (2003)
Vapnik C.C.: Support vector network. Mach. Learn. 20, 273–297 (1995)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Xiamen University, Xiamen, 361005, People’s Republic of China
Yanfang Ye
School of Mathematics and Computer Science, Fujian Normal University, Fuzhou, 350108, People’s Republic of China
Lifei Chen
School of Computer Science, Florida International University, Miami, FL, 33199, USA
Dingding Wang & Tao Li
Software School, Xiamen University, Xiamen, 361005, People’s Republic of China
Qingshan Jiang
Anti-virus Laboratory, KingSoft Corporation, Zhuhai, 519000, People’s Republic of China
Min Zhao

Authors

Yanfang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Lifei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dingding Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Li
View author publications
You can also search for this author in PubMed Google Scholar
Qingshan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ye, Y., Chen, L., Wang, D. et al. SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging. J Comput Virol 5, 283–293 (2009). https://doi.org/10.1007/s11416-008-0108-y

Download citation

Received: 01 June 2008
Revised: 23 October 2008
Accepted: 05 November 2008
Published: 26 November 2008
Issue Date: November 2009
DOI: https://doi.org/10.1007/s11416-008-0108-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Abstract

Access this article

Similar content being viewed by others

Malware Detection Using API Function Frequency with Ensemble Based Classifier

Malware Detection Using API Function Calls

Smart Boosted Model for Behavior-Based Malware Analysis and Detection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Abstract

Access this article

Similar content being viewed by others

Malware Detection Using API Function Frequency with Ensemble Based Classifier

Malware Detection Using API Function Calls

Smart Boosted Model for Behavior-Based Malware Analysis and Detection

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation