Abstract
Most of the commercial antivirus software fail to detect unknown and new malicious code. In order to handle this problem generic virus detection is a viable option. Generic virus detector needs features that are common to viruses. Recently Kolter et al. [16] propose an efficient generic virus detector using n-grams as features. The fixed length n-grams used there suffer from the drawback that they cannot capture meaningful sequences of different lengths. In this paper we propose a new method of variable-length n-grams extraction based on the concept of episodes and demonstrate that they outperform fixed length n-grams in malicious code detection. The proposed algorithm requires only two scans over the whole data set whereas most of the classical algorithms require scans proportional to the maximum length of n-grams.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Anagnostakis, K.G., Sidiroglou, S., Akritidis, P., Xinidis, K., Markatos, E., Keromytis, A.D.: Detecting targeted attacks using shadow honeypots. In: Proceedings of the 14th USENIX Security Symposium (2005)
Arnold, W., Tesauro, G.: Automatically generated Win32 heuristic virus detection. In: Proceedings of the 2000 International Virus Bulletin Conference (2000)
Assaleh, T.A., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using N-grams signatures. In: Proceedings of the Second Annual Conference on Privacy, Security and Trust, pp. 193–196 (2004)
Balzer, R., Goldman, N.: Mediating Connectors. In: Proceedings of the 19th IEEE International Conference on Distributed Computing Systems Workshop, Austin, TX, pp. 73–77 (1999)
Cavnar, W., Trenkle, J.: N-gram based text categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX Security Symp., Washington, DC, August 2003, pp. 169–186 (2003)
Cohen, P., Heeringa, B., Adams, N.M.: An unsupervised algorithm for segmenting categorical timeseries into episodes. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds.) Pattern Detection and Discovery. LNCS (LNAI), vol. 2447, pp. 49–62. Springer, Heidelberg (2002)
Dash, S.K., Reddy, K.S., Pujari, A.K.: Episode Based Masquerade Detection. In: Jajodia, S., Mazumdar, C. (eds.) ICISS 2005. LNCS, vol. 3803, pp. 251–262. Springer, Heidelberg (2005)
Debar, H., Dacier, M., Nassehi, M., Wespi, A.: Fixed vs. variable-length patterns for detecting suspicious process behavior. Journal of Computer Security 8(2/3) (2000)
Firoiu, L.: Segmenting Time Series with a Hybrid Neural Networks – Hidden Markov Model (2002), http://www.citeseer.ist.psu.edu/firoiu02segmenting.html
Furnkranz, J.: A study using n-gram features for text categorization. Technical Report OEFAI-TR-9830, Austrian Research Institute for Artificial Intelligence (1998)
Gartner Inc. (2005), http://www.gartner.com/press_releases/asset_129199_11.html
Gionis, A., Mannila, H.: Segmentation Algorithms for Time Series and Sequence Data. In: SIAM International Conference on Data Mining, Newport Beach, CA (2005)
Jiang, G., Chen, H., Ungureanu, C., Yoshihira, K.: Multi-resolution abnormal trace detection using varied-length N-grams and automata. In: Proceedings of the Second International Conference on Autonomic Computing (2005)
Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., Tesauro, G.J., White, S.R.: Biologically inspired defenses against computer viruses. In: Proceedings of IJCAI 1995, Montreal, August 1995, pp. 985–996 (1995)
Kolter, J.K., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004)
Lo, R.W., Levitt, K.N., Olsson, R.A.: MCF: A malicious code filter. Computers & Society 14(6), 541–566 (1995)
Marceau, C.: Characterizing the behavior of a program using multiple-length N-grams. In: Proceedings of the 2000 Workshop on New security paradigms (2000)
McGraw, G., Morrisett, G.: Attacking Malicious Code: A Report to the Infosec Research Council. IEEE Software (September/October 2000)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Nachenberg, C.: Understanding and managing polymorphic viruses. The Symantec Exterprise Papers, vol. XXX
Reddy, D.K.S., Pujari, A.K.: N-gram Analysis for New Computer Virus Detection. Communicated to the Journal in Computer Virology
Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of IEEE Symposium on Security and Privacy (2001)
Schultz, M.G., Eskin, E., Zadok, E., Bhattacharyya, M., Stolfo, S.J.: MEF: Malicious Email Filter, A UNIX mail filter that detects malicious windows executables. In: Proceedings of USENIX Annual Technical Conference (2001)
Szor, P.: The Art of Computer Virus Research and Defense. Addison Wesley, Reading (2005)
VX Heavens, http://vx.netlux.org
Witten, I., Frank, E.: Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco (2000)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Reddy, D.K.S., Dash, S.K., Pujari, A.K. (2006). New Malicious Code Detection Using Variable Length n-grams. In: Bagchi, A., Atluri, V. (eds) Information Systems Security. ICISS 2006. Lecture Notes in Computer Science, vol 4332. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11961635_19
Download citation
DOI: https://doi.org/10.1007/11961635_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68962-1
Online ISBN: 978-3-540-68963-8
eBook Packages: Computer ScienceComputer Science (R0)