Malware classification and composition analysis: A survey of recent developments

https://doi.org/10.1016/j.jisa.2021.102828Get rights and content

Abstract

Malware detection and classification are becoming more and more challenging, given the complexity of malware design and the recent advancement of communication and computing infrastructure. The existing malware classification approaches enable reverse engineers to better understand their patterns and categorizations, and to cope with their evolution. Moreover, new compositions analysis methods have been proposed to analyze malware samples with the goal of gaining deeper insight on their functionalities and behaviors. This, in turn, helps reverse engineers discern the intent of a malware sample and understand the attackers’ objectives. This survey classifies and compares the main findings in malware classification and composition analyses. We also discuss malware evasion techniques and feature extraction methods. Besides, we characterize each reviewed paper on the basis of both algorithms and features used, and highlight its strengths and limitations. We furthermore present issues, challenges, and future research directions related to malware analysis.

Introduction

In the recent years, many cyber-security mechanisms have been designed and developed to defend against evolving security threats. Nevertheless, recent statistics [1] indicate that malware are still evolving and becoming more sophisticated than ever. As a result, they become harder to detect and understand their innerworkings. This mainly stems from two essential reasons. The first is that attackers have now become more proficient in launching attacks and hiding their malicious behavior using anti-analysis techniques such as obfuscation and packing. The second reason is that the current communication and computing infrastructure is becoming more and more dynamic and heterogeneous, which enables a single malware to take various forms that are semantically but not structurally similar. This, in turn, makes malware analysis even more challenging.

Malware (or Malicious software) is a software that is designed to harm users, organizations, and telecommunication and computer system. More specifically, malware can block internet connection, corrupt an operating system, steal a user’s password and other private information, and/or encrypt important documents on a computer and demand ransom. For the latest years, malware has been a growing threat to computer users and in 2017 the number of new malware increased by 22,9% over 2016 to reach 8,400,058 [2], [3], [4], [5]. Moreover, malware has become the primary medium to launch large-scale attacks, such as compromising computers, bringing down hosts and servers, sending out spam emails, crippling critical infrastructures and penetrating data centers [6], [7], [8]. These attacks lead to severe damage and significant financial loss [9], [10], [11].

Most antivirus engines detect and classify malware by continuously scanning files and comparing their signatures with known malware signatures. The malware signatures are typically created by human antivirus experts (known as malware defenders) who examine the collected malware samples. These malware signatures can be filename, text strings, or regular expressions of byte code [12], [13]. Obviously, signature-based methods can only detect traditional malware that do not change significantly. However, malware can hide its malicious behavior using anti-analysis techniques such as obfuscation, packing, polymorphism and metamorphism, in such a way that the code would look quite different from its original version. Thus, the primary shortcoming of the signature-based method is that they entail high precision but low recall. Also, the process of creating malware signatures is labor-intensive. Considering that there is a large number of new malware that appear every day, there is a pressing need to develop new intelligent malware analysis methods to tackle the challenges.

To alleviate the burden of manual signature crafting, researchers propose automatic signature generation methods [14], [15]. The content of the signatures can be Windows system call combinations [16], control flow graph [15], and functions [14].

Researchers also propose to use machine learning models to detect and classify malware [12], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. Different from other machine learning-driven classification tasks, such as image classification, there is a competition between malware creators and defenders. When malware defenders propose a new malware analysis system using some features and machine learning models, malware creators often update their malware design to avoid being detected. Then malware defenders would propose new systems to detect and analyze the new generation of malware and so forth. The race between malware defenders and attackers may never come to an end.

Recently, many researchers have started to use deep learning models to enhance the detection and classification accuracy of malware classification [24], [25], [26], [27]. Although promising results have been achieved through the ability to extract robust and useful features using the state-of-the-art deep learning architectures, the proposed models were shown to be highly vulnerable to adversarial examples, which can be easily designed (simply by perpetuating parts of the inputs) by attackers to fool Artificial Intelligence (AI)-driven malware analysis systems and make them generate erroneous decisions [24], [25], [26], [27], [28], [29]. As a result, several methods have been proposed to defend against adversarial examples [28], [29].

In addition to malware classification, researchers in malware analysis have improved new techniques and methods to analyze the composition of malware samples by matching their functionalities and behaviors to multiple known malware families. This, in turn, helps reverse engineers discern the intent of a malware sample and the attacker. Moreover, these composition methods enable the reverse engineers and organizations to effectively triage their resources.

This literature review classifies and compares the recent and main findings in malware classification. Unlike other similar works which only focus either on AI-driven malware classification [30], [31], [32] or on non-AI-driven malware classification [33], [34], this paper includes both AI-driven and non-AI-driven recent works. We are also surveying methods and approaches that recently have been proposed to analyze the composition of malware samples, in order to understand their functionalities and behaviors. To the best of our knowledge, this is the first work that survey the existing composition analysis techniques. This survey also aims at identifying the main issues and challenges related to recent malware classification and composition analysis techniques. In particular, our analysis leads to recognize three major problems to address. The first is the need to overcome modern evading techniques (or anti-analysis techniques) such as metamorphism. The second relates to the efficiency and scalability of malware search engines as the number of functions in the repository might need to scale up to millions. The third concerns the vulnerability of malware classification system to evolving adversarial examples. We also uncover possible topics that need further study and investigation, such as sustainable malware analysis system. In this regard, we propose a few guidelines to prepare efficient and trustworthy malware detection and analysis system.

The main contributions of this survey are:

  • Proposing a new taxonomy for describing and comparing the recent and main findings in malware classification and composition analysis.

  • Designing a new framework for analyzing the existing malware classification and composition analysis techniques.

  • Identifying and presenting open issues and challenges related to malware analysis.

  • Identifying a number of trends on the topic, with guidelines on how to improve existing solutions to address new and continuing challenges.

The rest of this paper is organized as follows. In Section 2, we discuss the related survey papers. In Section 3 and Section 4, we present the proposed taxonomy for organizing reviewed malware classification and composition analysis approaches, respectively. Section 5 characterizes reviewed papers according to the proposed taxonomy. The challenges and current issues are pointed out in Section 6. Section 7 suggests possible research topics in malware analysis. Finally, Section 8 concludes the paper.

Section snippets

Related surveys

Other works have already surveyed contributions in malware classification. For example, Bazrafshan et al. [33] classify malware detection and classify methods into three types: signature-based, behavior-based and heuristic-based methods. Also, they recognize five classes of features based on the proposed heuristic-based method: opcodes, API calls, control flow graphs, n-grams, and hybrid features. Another work presented by Shabtai et al. [34], which studies how to detect malware using static

Taxonomy of malware classification

We present in this section the taxonomy of malware classification. We define two categories (or dimensions) to organize the existing works. The first category presents the features that our work is based on. In particular, we discuss the different methodologies used for extracting features, e.g., dynamic and/or static techniques, and what types of features are used, e.g., assembly code. The second is concerned with the type of algorithm that is adopted for the detection and analysis,

Taxonomy of composition analysis techniques

This section introduces the taxonomy of malware composition analysis techniques. We identify two major dimensions along which surveyed papers can be conveniently organized. The first one shows the steps used for composition analysis. The second dimension identifies the objective (i.e., strategy) of the analysis. Fig. 3 shows a graphical representation of the proposed taxonomy.

Characterization of surveyed papers

In this section, we characterize each reviewed paper. Table 1 provides information about both algorithms and features used for each paper and highlights the main limitations. The table also shows the scalability of each work in terms of its ability to work in the presence of incremental update of the repository. The last column shows whether the proposed classification techniques are robust against anti-analysis techniques or not. As can be seen in Table 1, most of the works use more than one

Challenges and issues

Based on the characterization explained in Section 5, we discuss here the challenges and/or issues of the surveyed articles.

Research direction

The above contributions are effective in addressing some interesting research gaps in the literature. However, some points still need further study and investigation. The following research avenues could be further explored based on our literature review:

Conclusion

In this paper, we provide a comprehensive survey on publications that contributed to malware classification and composition analysis. There are four main contributions in our work. First, we proposed an organization of reviewed paper according to three dimensions: the purpose of the analysis (malware classification or composition analysis), the type of features obtained from samples, and the algorithms used to manipulate these features. Second, we provided a comparative analysis of the existing

CRediT authorship contribution statement

Adel Abusitta: Conceptualization, Methodology, Data curation, Writing - original draft, Validation, Writing - reviewing and editing, Supervision, Visualization, Investigation. Miles Q. Li: Conceptualization, Methodology, Data curation, Writing - original draft, Validation, Writing - reviewing and editing, Visualization, Investigation. Benjamin C.M. Fung: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - original draft, Writing - reviewing and editing, Project

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported in part by the DND Innovation for Defence Excellence and Security, Canada (W7714-207117/001/SV), NSERC, Canada Discovery Grants (RGPIN-2018-03872), and Canada Research Chairs Program (950-230623). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

References (155)

  • TailorJ.P. et al.

    A comprehensive survey: ransomware attacks prevention, monitoring and damage control

    Int J Res Sci Innov

    (2017)
  • VignauB. et al.

    10 years of IoT malware: A feature-based taxonomy

  • Xu Z, Wang H, Xu Z, Wang X. Power attack: An increasing threat to data centers. In: NDSS....
  • JakobssonM. et al.

    Crimeware: understanding new attacks and defenses

    (2008)
  • WongW. et al.

    Hunting for metamorphic engines

    J Comput Virol

    (2006)
  • TariqN.

    Impact of cyberattacks on financial institutions

    J Internet Bank Commer

    (2018)
  • ChenL. et al.

    Adversarial machine learning in malware detection: Arms race between evasion attack and defense

  • SchultzM.G. et al.

    Data mining methods for detection of new malicious executables

  • ChristodorescuM. et al.

    Static analysis of executables to detect malicious patternsTechnical report

    (2006)
  • ChenJ. et al.

    Detecting android malware using clone detection

    J Comput Sci Tech

    (2015)
  • CesareS. et al.

    Classification of malware using structured control flow

  • YeY. et al.

    Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

    J Intell Inf Syst

    (2010)
  • KolterJ.Z. et al.

    Learning to detect malicious executables in the wild

  • MoskovitchR. et al.

    Unknown malcode detection using opcode representation

  • DaiJ. et al.

    Efficient virus detection using dynamic instruction sequences

    J Comput Phys

    (2009)
  • NatarajL. et al.

    Malware images: visualization and automatic classification

  • AndersonB. et al.

    Graph-based malware detection using dynamic analysis

    J Comput Virol

    (2011)
  • SantosI. et al.

    Opem: A static-dynamic approach for machine-learning-based malware detection

  • DahlG.E. et al.

    Large-scale malware classification using random projections and neural networks

  • SaxeJ. et al.

    Deep neural network based malware detection using two dimensional binary program features

  • HuangW. et al.

    MtNet: a multi-task neural network for dynamic malware classification

  • KolosnjajiB. et al.

    Deep learning for classification of malware system call sequences

  • GrosseK. et al.

    Adversarial examples for malware detection

  • WangQ. et al.

    Adversary resistant deep neural networks with an application to malware detection

  • UcciD. et al.

    Survey of machine learning techniques for malware analysis

    Comput Secur

    (2018)
  • SahuM.K. et al.

    A review of malware detection based on pattern matching technique

    Int J Comput Sci Inf Technol

    (2014)
  • SouriA. et al.

    A state-of-the-art survey of malware detection approaches using data mining techniques

    Human-centric Comput Inf Sci

    (2018)
  • BazrafshanZ. et al.

    A survey on heuristic malware detection techniques

  • BasuI. et al.

    Malware detection based on source data using data mining: A survey

    Am J Adv Comput

    (2016)
  • YeY. et al.

    A survey on malware detection using data mining techniques

    ACM Comput Surv

    (2017)
  • Or-MeirO. et al.

    Dynamic malware analysis in the modern era—A state of the art survey

    ACM Comput Surv

    (2019)
  • BarrigaJ. et al.

    Malware detection and evasion with machine learning techniques: A survey

    Int J Appl Eng Res

    (2017)
  • DamodaranA. et al.

    A comparison of static, dynamic, and hybrid analysis for malware detection

    J Comput Virol Hacking Tech

    (2017)
  • BayerU. et al.

    Dynamic analysis of malicious code

    J Comput Virol

    (2006)
  • AndersonB. et al.

    Improving malware classification: bridging the static/dynamic gap

  • RoyalP. et al.

    Polyunpack: Automating the hidden-code extraction of unpack-executing malware

  • FredriksonM. et al.

    Synthesizing near-optimal malware specifications from suspicious behaviors

  • Force UA. Analysis of the Intel Pentium’s ability to support a secure virtual machine monitor. In: Proceedings of the...
  • RutkowskaJ.

    Redpill: Detect VMM using (almost) one CPU instruction

    (2004)
  • LiangG. et al.

    A behavior-based malware variant classification technique

    Int J Inf Educ Technol

    (2016)
  • Cited by (55)

    • Deep learning-enabled anomaly detection for IoT systems

      2023, Internet of Things (Netherlands)
      Citation Excerpt :

      In supervised machine learning-based anomaly detection [19,20], the machine learning model is trained on labeled datasets, while in unsupervised machine learning-based anomaly detection [21,22], the machine learning model works on learning patterns and features using unlabeled dataset. Finally, the semi-supervised method [23,24] adopts both labeled and unlabeled datasets in the training and learning process [25]. Below we discuss recent machine learning-based anomaly detection.

    • A multi-view feature fusion approach for effective malware classification using Deep Learning

      2023, Journal of Information Security and Applications
      Citation Excerpt :

      Malware detection has been a vivid area of research and various approaches were proposed for malware detection. The detailed study and analysis on malware detection, particularly, windows executable malware detection are described in [3,21,22]. Gibert et al. in [3] performed a comprehensive survey of the malware detection and classification using ML techniques and also discussed recent trends leveraging DL approaches to defend against malware attacks.

    View all citing articles on Scopus
    View full text