Unknown malcode detection and the imbalance problem

Moskovitch, Robert; Stopel, Dima; Feher, Clint; Nissim, Nir; Japkowicz, Nathalie; Elovici, Yuval

doi:10.1007/s11416-009-0122-8

Unknown malcode detection and the imbalance problem

Original Paper
Published: 11 July 2009

Volume 5, pages 295–308, (2009)
Cite this article

Journal in Computer Virology Aims and scope Submit manuscript

Robert Moskovitch¹,
Dima Stopel¹,
Clint Feher¹,
Nir Nissim¹,
Nathalie Japkowicz² &
…
Yuval Elovici¹

295 Accesses
55 Citations
3 Altmetric
Explore all metrics

Abstract

The recent growth in network usage has motivated the creation of new malicious code for various purposes. Today’s signature-based antiviruses are very accurate for known malicious code, but can not detect new malicious code. Recently, classification algorithms were used successfully for the detection of unknown malicious code. But, these studies involved a test collection with a limited size and the same malicious: benign file ratio in both the training and test sets, a situation which does not reflect real-life conditions. We present a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection. We performed an extensive evaluation, consisting of a test collection of more than 30,000 files, in which we investigated the class imbalance problem. In real-life scenarios, the malicious file content is expected to be low, about 10% of the total files. For practical purposes, it is unclear as to what the corresponding percentage in the training set should be. Our results indicate that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Filiol E., Josse S.: A statistical model for undecidable viral detection. J. Comput. Virol. 3, 65–74 (2007)
Article Google Scholar
Filiol E.: Malware pattern scanning schemes secure against black-box analysis. J. Comput. Virol. 2, 35–50 (2006)
Article Google Scholar
Gryaznov, D.: Scanners of the year 2000: Heuritics. In: Proceedings of the 5th International Virus Bulletin (1999)
Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy, 178–184 (2001)
Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram based detection of new malicious code. In: Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC’04) (2004)
Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)
Mitchell T.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Henchiri, O., Japkowicz, N.: A feature selection and evaluation scheme for computer virus detection. In: Proceedings of ICDM-2006, pp. 891–895. Hong Kong (2006)
Reddy D., Pujari A.: N-gram analysis for computer virus detection. J. Comput. Virol. 2, 231–239 (2006)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186 (1997)
Fawcett T.E., Provost F.: Adaptive fraud detection. Data Min. Knowl. Discov. 1(3), 291–316 (1997)
Article Google Scholar
Ling, C.X., Li, C.: Data mining for direct marketing: problems and solutions. In: Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–79 (1998)
Chawla N.V., Japkowicz N., Kotcz A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004)
Article Google Scholar
Japkowicz N., Stephen S.: The class imbalance problem: a systematic study. Intel. Data Anal. J. 6, 5 (2002)
Google Scholar
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intel. Res. (JAIR) 16, 321–357 (2002)
MATH Google Scholar
Lawrence S., Burns I., Back A.D., Tsoi A.C., Giles C.L.: Neural network classification and unequal prior class probabilities. In: Orr, G., Muller, R.-R., Caruana, R.(eds) Tricks of the Trade. Lecture Notes in Computer Science State-of-the-Art Surveys, pp. 299–314. Springer, Heidelberg (1998)
Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn unbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley (2004)
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. In: Proceedings of the International Conference of Machine Learning, pp. 268–277 (1999)
Weiss G., Provost F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intel. Res. 19, 315–354 (2003)
MATH Google Scholar
Salton G., Wong A., Yang C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Golub T., Slonim D., Tamaya P., Huard C., Gaasenbeek M., Mesirov J., Coller H., Loh M., Downing J., Caligiuri M., Bloomfield C., Lander E.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Article Google Scholar
Bishop C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
Google Scholar
Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Francisco (1993)
Google Scholar
Witten I.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (2005)
MATH Google Scholar
Domingos P., Pazzani M.: On the optimality of simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)
Article MATH Google Scholar
Freund Y., Schapire R.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)
Article MATH MathSciNet Google Scholar
Burges C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 955–974 (1998)
Article Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)
Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001)
Provost, F., Fawcett, T.: Robust classification systems for imprecise environments. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (1998)
Kubat M., Holte R., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Article Google Scholar
Karim Md., Walenstein A., Lakhotia A., Parida L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1, 13–23 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Deutsche Telekom Laboratories, Department of Information Systems Engineering, Ben Gurion University, 84105, Be’er Sheva, Israel
Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim & Yuval Elovici
School of Information Technology and Engineering, University of Ottawa, Ottawa, ON, K1N 6N5, Canada
Nathalie Japkowicz

Authors

Robert Moskovitch
View author publications
You can also search for this author in PubMed Google Scholar
Dima Stopel
View author publications
You can also search for this author in PubMed Google Scholar
Clint Feher
View author publications
You can also search for this author in PubMed Google Scholar
Nir Nissim
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Japkowicz
View author publications
You can also search for this author in PubMed Google Scholar
Yuval Elovici
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Moskovitch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moskovitch, R., Stopel, D., Feher, C. et al. Unknown malcode detection and the imbalance problem. J Comput Virol 5, 295–308 (2009). https://doi.org/10.1007/s11416-009-0122-8

Download citation

Received: 11 September 2008
Accepted: 06 May 2009
Published: 11 July 2009
Issue Date: November 2009
DOI: https://doi.org/10.1007/s11416-009-0122-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unknown malcode detection and the imbalance problem

Abstract

Access this article

Similar content being viewed by others

Automatic malware classification and new malware detection using machine learning

Exploring Discriminatory Features for Automated Malware Classification

An investigation of byte n-gram features for malware classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unknown malcode detection and the imbalance problem

Abstract

Access this article

Similar content being viewed by others

Automatic malware classification and new malware detection using machine learning

Exploring Discriminatory Features for Automated Malware Classification

An investigation of byte n-gram features for malware classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation