N-gram analysis for computer virus detection

Reddy, D Krishna Sandeep; Pujari, Arun K

doi:10.1007/s11416-006-0027-8

N-gram analysis for computer virus detection

Original Paper
Published: 08 November 2006

Volume 2, pages 231–239, (2006)
Cite this article

Journal in Computer Virology Aims and scope Submit manuscript

D Krishna Sandeep Reddy¹ &
Arun K Pujari¹

727 Accesses
93 Citations
3 Altmetric
Explore all metrics

Abstract

Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST, pp. 193–196 (2004)
Arnold, W., Tesauro, G.: Automatically generated win32 heuristic virus detection. In: Proceedings of the 2000 International Virus Bulletin Conference (2000)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. Las Vegas, US (1994)
Cohen F. (1987) Computer viruses: theory and experiments. Comput. Secur. 6(1):22–35
Article Google Scholar
Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX Security Symposium (Security’03), pp. 169–186. USENIX Association, USENIX Association (2003)
Duin, R.P.W., Tax, D.M.J.: Experiments with classifier combining rules. In: MCS ’00: Proceedings of the First International Workshop on Multiple Classifier Systems, London, pp. 16–29. Springer, Berlin Heidelberg New York (2000)
Karim Md.E., Walenstein A., Lakhotia A., Parida L. (2005) Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2):13–23
Article Google Scholar
Gartner Inc: http://www.gartner.com/press_releases/asset_129199_11.html (2005)
Johannes, F.: A study using n-gram features for text categorization. Technical Report OEFAI-TR-9830, Austrian Institute for Artificial Intelligence (1998)
Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., , G.J., White, S.R.: Biologically inspired defenses against computer viruses. In: Proceedings of the 14th IJCAI, pp. 985–996, Montreal (1995)
Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: KDD ’04: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)
Lefevre E., Colot O., Vannoorenberghe P. (2002) Belief function combination and conflict management. Inf. Fusion 3(2):149–162
Article Google Scholar
McGraw G., Morrisett G. (2000) Attacking malicious code: a report to the infosec research council. IEEE Soft. 17(5):33–41
Article Google Scholar
Mitchell T.M. (1997) Machine Learning. McGraw-Hill, New York
MATH Google Scholar
Murphy C.K. (2000) Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1–9
Article MATH Google Scholar
Nachenberg, C.: Understanding and managing polymorphic viruses. Technical Report, The Symantec Exterprise Papers: Vol. XXX
Shafer G. (1976) A Mathematical Theory of Evidence. Princeton University Press, Princeton
MATH Google Scholar
Schultz, M.G., Eskin, E., Zadok, E., Bhattacharyya, M., Stolfo, S.J.: Mef: Malicious email filter – a unix mail filter that detects malicious windows executables. In: Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pp. 245–252. USENIX Association, Berkeley (2001)
Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: SP ’01: Proceedings of the 2001 IEEE Symposium on Security and Privacy, p. 38. IEEE Computer Society, Washington (2001)
Sentz, K.: Combination of evidence in Dempster–Shafer theory. Ph.D. Thesis, SNL, LANL, and Systems Science and Industrial Engineering Department, Binghamton University
Smets P. (1993) Belief functions: The disjunctive rule of combination and the generalized bayesian theorem. Int. J. Approx. Reason. 9(1):1–35
Article MATH MathSciNet Google Scholar
Szor P. (2005) The Art of Computer Virus Research and Defense. Addison Wesley, Reading
Google Scholar
Ting K.M., Witten I.H. (1999) Issues in stacked generalization. J. Artif. Intell. Res. 10, 271–289
MATH Google Scholar
Vx heavens: http://www.vx.netlux.org
Witten I., Frank E. (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Google Scholar
Wolpert, D.H.: Stacked generalization. Technical Report LA-UR-90-3460, Los Alamos (1990)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Yoo, I., Ultes-Nitsche, U.: Non-signature based virus detection: Towards establishing unknown virus detection technique using som. J. Comput. Virol. 2(3) (2006)
Zhang, B., Srihari, S.N.: Class-wise multi-classifier combination based on dempster-shafer theory. In: Proceedings of the 7th International Conference on Control, Automation, Robotics and Vision (2002)

Download references

Author information

Authors and Affiliations

Artificial Intelligence Lab, University of Hyderabad, Hyderabad, 500 046, India
D Krishna Sandeep Reddy & Arun K Pujari

Authors

D Krishna Sandeep Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Arun K Pujari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arun K Pujari.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reddy, D.K.S., Pujari, A.K. N-gram analysis for computer virus detection. J Comput Virol 2, 231–239 (2006). https://doi.org/10.1007/s11416-006-0027-8

Download citation

Received: 10 August 2006
Revised: 19 September 2006
Accepted: 01 October 2006
Published: 08 November 2006
Issue Date: December 2006
DOI: https://doi.org/10.1007/s11416-006-0027-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

N-gram analysis for computer virus detection

Abstract

Access this article

Similar content being viewed by others

An investigation of byte n-gram features for malware classification

Fast and Straightforward Feature Selection Method

Machine Learning and Network Traffic to Distinguish Between Malware and Benign Applications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

N-gram analysis for computer virus detection

Abstract

Access this article

Similar content being viewed by others

An investigation of byte n-gram features for malware classification

Fast and Straightforward Feature Selection Method

Machine Learning and Network Traffic to Distinguish Between Malware and Benign Applications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation