Skip to main content
Log in

N-gram analysis for computer virus detection

  • Original Paper
  • Published:
Journal in Computer Virology Aims and scope Submit manuscript

Abstract

Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST, pp. 193–196 (2004)

  2. Arnold, W., Tesauro, G.: Automatically generated win32 heuristic virus detection. In: Proceedings of the 2000 International Virus Bulletin Conference (2000)

  3. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. Las Vegas, US (1994)

  4. Cohen F. (1987) Computer viruses: theory and experiments. Comput. Secur. 6(1):22–35

    Article  Google Scholar 

  5. Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX Security Symposium (Security’03), pp. 169–186. USENIX Association, USENIX Association (2003)

  6. Duin, R.P.W., Tax, D.M.J.: Experiments with classifier combining rules. In: MCS ’00: Proceedings of the First International Workshop on Multiple Classifier Systems, London, pp. 16–29. Springer, Berlin Heidelberg New York (2000)

  7. Karim Md.E., Walenstein A., Lakhotia A., Parida L. (2005) Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2):13–23

    Article  Google Scholar 

  8. Gartner Inc: http://www.gartner.com/press_releases/asset_129199_11.html (2005)

  9. Johannes, F.: A study using n-gram features for text categorization. Technical Report OEFAI-TR-9830, Austrian Institute for Artificial Intelligence (1998)

  10. Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., , G.J., White, S.R.: Biologically inspired defenses against computer viruses. In: Proceedings of the 14th IJCAI, pp. 985–996, Montreal (1995)

  11. Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: KDD ’04: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)

  12. Lefevre E., Colot O., Vannoorenberghe P. (2002) Belief function combination and conflict management. Inf. Fusion 3(2):149–162

    Article  Google Scholar 

  13. McGraw G., Morrisett G. (2000) Attacking malicious code: a report to the infosec research council. IEEE Soft. 17(5):33–41

    Article  Google Scholar 

  14. Mitchell T.M. (1997) Machine Learning. McGraw-Hill, New York

    MATH  Google Scholar 

  15. Murphy C.K. (2000) Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1–9

    Article  MATH  Google Scholar 

  16. Nachenberg, C.: Understanding and managing polymorphic viruses. Technical Report, The Symantec Exterprise Papers: Vol. XXX

  17. Shafer G. (1976) A Mathematical Theory of Evidence. Princeton University Press, Princeton

    MATH  Google Scholar 

  18. Schultz, M.G., Eskin, E., Zadok, E., Bhattacharyya, M., Stolfo, S.J.: Mef: Malicious email filter – a unix mail filter that detects malicious windows executables. In: Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pp. 245–252. USENIX Association, Berkeley (2001)

  19. Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: SP ’01: Proceedings of the 2001 IEEE Symposium on Security and Privacy, p. 38. IEEE Computer Society, Washington (2001)

  20. Sentz, K.: Combination of evidence in Dempster–Shafer theory. Ph.D. Thesis, SNL, LANL, and Systems Science and Industrial Engineering Department, Binghamton University

  21. Smets P. (1993) Belief functions: The disjunctive rule of combination and the generalized bayesian theorem. Int. J. Approx. Reason. 9(1):1–35

    Article  MATH  MathSciNet  Google Scholar 

  22. Szor P. (2005) The Art of Computer Virus Research and Defense. Addison Wesley, Reading

    Google Scholar 

  23. Ting K.M., Witten I.H. (1999) Issues in stacked generalization. J. Artif. Intell. Res. 10, 271–289

    MATH  Google Scholar 

  24. Vx heavens: http://www.vx.netlux.org

  25. Witten I., Frank E. (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco

    Google Scholar 

  26. Wolpert, D.H.: Stacked generalization. Technical Report LA-UR-90-3460, Los Alamos (1990)

  27. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, pp. 412–420. Morgan Kaufmann, San Francisco (1997)

  28. Yoo, I., Ultes-Nitsche, U.: Non-signature based virus detection: Towards establishing unknown virus detection technique using som. J. Comput. Virol. 2(3) (2006)

  29. Zhang, B., Srihari, S.N.: Class-wise multi-classifier combination based on dempster-shafer theory. In: Proceedings of the 7th International Conference on Control, Automation, Robotics and Vision (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arun K Pujari.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reddy, D.K.S., Pujari, A.K. N-gram analysis for computer virus detection. J Comput Virol 2, 231–239 (2006). https://doi.org/10.1007/s11416-006-0027-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-006-0027-8

Keywords

Navigation