Malware classification and composition analysis: A survey of recent developments

doi:10.1016/j.jisa.2021.102828

Journal of Information Security and Applications

Volume 59, June 2021, 102828

https://doi.org/10.1016/j.jisa.2021.102828 Get rights and content

Abstract

Malware detection and classification are becoming more and more challenging, given the complexity of malware design and the recent advancement of communication and computing infrastructure. The existing malware classification approaches enable reverse engineers to better understand their patterns and categorizations, and to cope with their evolution. Moreover, new compositions analysis methods have been proposed to analyze malware samples with the goal of gaining deeper insight on their functionalities and behaviors. This, in turn, helps reverse engineers discern the intent of a malware sample and understand the attackers’ objectives. This survey classifies and compares the main findings in malware classification and composition analyses. We also discuss malware evasion techniques and feature extraction methods. Besides, we characterize each reviewed paper on the basis of both algorithms and features used, and highlight its strengths and limitations. We furthermore present issues, challenges, and future research directions related to malware analysis.

Introduction

In the recent years, many cyber-security mechanisms have been designed and developed to defend against evolving security threats. Nevertheless, recent statistics [1] indicate that malware are still evolving and becoming more sophisticated than ever. As a result, they become harder to detect and understand their innerworkings. This mainly stems from two essential reasons. The first is that attackers have now become more proficient in launching attacks and hiding their malicious behavior using anti-analysis techniques such as obfuscation and packing. The second reason is that the current communication and computing infrastructure is becoming more and more dynamic and heterogeneous, which enables a single malware to take various forms that are semantically but not structurally similar. This, in turn, makes malware analysis even more challenging.

Malware (or Malicious software) is a software that is designed to harm users, organizations, and telecommunication and computer system. More specifically, malware can block internet connection, corrupt an operating system, steal a user’s password and other private information, and/or encrypt important documents on a computer and demand ransom. For the latest years, malware has been a growing threat to computer users and in 2017 the number of new malware increased by 22,9% over 2016 to reach 8,400,058 [2], [3], [4], [5]. Moreover, malware has become the primary medium to launch large-scale attacks, such as compromising computers, bringing down hosts and servers, sending out spam emails, crippling critical infrastructures and penetrating data centers [6], [7], [8]. These attacks lead to severe damage and significant financial loss [9], [10], [11].

Most antivirus engines detect and classify malware by continuously scanning files and comparing their signatures with known malware signatures. The malware signatures are typically created by human antivirus experts (known as malware defenders) who examine the collected malware samples. These malware signatures can be filename, text strings, or regular expressions of byte code [12], [13]. Obviously, signature-based methods can only detect traditional malware that do not change significantly. However, malware can hide its malicious behavior using anti-analysis techniques such as obfuscation, packing, polymorphism and metamorphism, in such a way that the code would look quite different from its original version. Thus, the primary shortcoming of the signature-based method is that they entail high precision but low recall. Also, the process of creating malware signatures is labor-intensive. Considering that there is a large number of new malware that appear every day, there is a pressing need to develop new intelligent malware analysis methods to tackle the challenges.

To alleviate the burden of manual signature crafting, researchers propose automatic signature generation methods [14], [15]. The content of the signatures can be Windows system call combinations [16], control flow graph [15], and functions [14].

Researchers also propose to use machine learning models to detect and classify malware [12], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. Different from other machine learning-driven classification tasks, such as image classification, there is a competition between malware creators and defenders. When malware defenders propose a new malware analysis system using some features and machine learning models, malware creators often update their malware design to avoid being detected. Then malware defenders would propose new systems to detect and analyze the new generation of malware and so forth. The race between malware defenders and attackers may never come to an end.

Recently, many researchers have started to use deep learning models to enhance the detection and classification accuracy of malware classification [24], [25], [26], [27]. Although promising results have been achieved through the ability to extract robust and useful features using the state-of-the-art deep learning architectures, the proposed models were shown to be highly vulnerable to adversarial examples, which can be easily designed (simply by perpetuating parts of the inputs) by attackers to fool Artificial Intelligence (AI)-driven malware analysis systems and make them generate erroneous decisions [24], [25], [26], [27], [28], [29]. As a result, several methods have been proposed to defend against adversarial examples [28], [29].

In addition to malware classification, researchers in malware analysis have improved new techniques and methods to analyze the composition of malware samples by matching their functionalities and behaviors to multiple known malware families. This, in turn, helps reverse engineers discern the intent of a malware sample and the attacker. Moreover, these composition methods enable the reverse engineers and organizations to effectively triage their resources.

This literature review classifies and compares the recent and main findings in malware classification. Unlike other similar works which only focus either on AI-driven malware classification [30], [31], [32] or on non-AI-driven malware classification [33], [34], this paper includes both AI-driven and non-AI-driven recent works. We are also surveying methods and approaches that recently have been proposed to analyze the composition of malware samples, in order to understand their functionalities and behaviors. To the best of our knowledge, this is the first work that survey the existing composition analysis techniques. This survey also aims at identifying the main issues and challenges related to recent malware classification and composition analysis techniques. In particular, our analysis leads to recognize three major problems to address. The first is the need to overcome modern evading techniques (or anti-analysis techniques) such as metamorphism. The second relates to the efficiency and scalability of malware search engines as the number of functions in the repository might need to scale up to millions. The third concerns the vulnerability of malware classification system to evolving adversarial examples. We also uncover possible topics that need further study and investigation, such as sustainable malware analysis system. In this regard, we propose a few guidelines to prepare efficient and trustworthy malware detection and analysis system.

The main contributions of this survey are:

•
Proposing a new taxonomy for describing and comparing the recent and main findings in malware classification and composition analysis.
•
Designing a new framework for analyzing the existing malware classification and composition analysis techniques.
•
Identifying and presenting open issues and challenges related to malware analysis.
•
Identifying a number of trends on the topic, with guidelines on how to improve existing solutions to address new and continuing challenges.

The rest of this paper is organized as follows. In Section 2, we discuss the related survey papers. In Section 3 and Section 4, we present the proposed taxonomy for organizing reviewed malware classification and composition analysis approaches, respectively. Section 5 characterizes reviewed papers according to the proposed taxonomy. The challenges and current issues are pointed out in Section 6. Section 7 suggests possible research topics in malware analysis. Finally, Section 8 concludes the paper.

Section snippets

Related surveys

Other works have already surveyed contributions in malware classification. For example, Bazrafshan et al. [33] classify malware detection and classify methods into three types: signature-based, behavior-based and heuristic-based methods. Also, they recognize five classes of features based on the proposed heuristic-based method: opcodes, API calls, control flow graphs, n-grams, and hybrid features. Another work presented by Shabtai et al. [34], which studies how to detect malware using static

Taxonomy of malware classification

We present in this section the taxonomy of malware classification. We define two categories (or dimensions) to organize the existing works. The first category presents the features that our work is based on. In particular, we discuss the different methodologies used for extracting features, e.g., dynamic and/or static techniques, and what types of features are used, e.g., assembly code. The second is concerned with the type of algorithm that is adopted for the detection and analysis,

Taxonomy of composition analysis techniques

This section introduces the taxonomy of malware composition analysis techniques. We identify two major dimensions along which surveyed papers can be conveniently organized. The first one shows the steps used for composition analysis. The second dimension identifies the objective (i.e., strategy) of the analysis. Fig. 3 shows a graphical representation of the proposed taxonomy.

Characterization of surveyed papers

In this section, we characterize each reviewed paper. Table 1 provides information about both algorithms and features used for each paper and highlights the main limitations. The table also shows the scalability of each work in terms of its ability to work in the presence of incremental update of the repository. The last column shows whether the proposed classification techniques are robust against anti-analysis techniques or not. As can be seen in Table 1, most of the works use more than one

Challenges and issues

Based on the characterization explained in Section 5, we discuss here the challenges and/or issues of the surveyed articles.

Research direction

The above contributions are effective in addressing some interesting research gaps in the literature. However, some points still need further study and investigation. The following research avenues could be further explored based on our literature review:

Conclusion

In this paper, we provide a comprehensive survey on publications that contributed to malware classification and composition analysis. There are four main contributions in our work. First, we proposed an organization of reviewed paper according to three dimensions: the purpose of the analysis (malware classification or composition analysis), the type of features obtained from samples, and the algorithms used to manipulate these features. Second, we provided a comparative analysis of the existing

CRediT authorship contribution statement

Adel Abusitta: Conceptualization, Methodology, Data curation, Writing - original draft, Validation, Writing - reviewing and editing, Supervision, Visualization, Investigation. Miles Q. Li: Conceptualization, Methodology, Data curation, Writing - original draft, Validation, Writing - reviewing and editing, Visualization, Investigation. Benjamin C.M. Fung: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - original draft, Writing - reviewing and editing, Project

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is supported in part by the DND Innovation for Defence Excellence and Security, Canada (W7714-207117/001/SV), NSERC, Canada Discovery Grants (RGPIN-2018-03872), and Canada Research Chairs Program (950-230623). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

References (155)

KimaniK. et al.
Cyber security challenges for IoT-based smart grid networks
Int J Crit Infrastruct Prot
(2019)
IslamR. et al.
Classification of malware based on integrated static and dynamic features
J Netw Comput Appl
(2013)
ShabtaiA. et al.
Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey
Inf Secur Tech Rep
(2009)
GhiasiM. et al.
Dynamic VSA: a framework for malware detection based on register contents
Eng Appl Artif Intell
(2015)
MohaisenA. et al.
Amal: High-fidelity, behavior-based automated malware analysis and classification
Comput Secur
(2015)
ChenZ. et al.
Malware characteristics and threats on the internet ecosystem
J Syst Softw
(2012)
SantosI. et al.
Opcode sequences as representation of executables for data-mining-based unknown malware detection
Inform Sci
(2013)
Malware statistics and facts for 2020. 2020. https://www.comparitech.com/antivirus/malware-statistics-facts/. [Accessed...
Malware Numbers 2017. 2019. https://www.gdatasoftware.com/blog/2018/03/30610-malware-number-2017. [Accessed 17 August...
Suarez-TangilG. et al.
Evolution, detection and analysis of malware for smart devices
IEEE Commun Surv Tutor
(2013)

TailorJ.P. et al.

A comprehensive survey: ransomware attacks prevention, monitoring and damage control

Int J Res Sci Innov

(2017)

VignauB. et al.

10 years of IoT malware: A feature-based taxonomy

Xu Z, Wang H, Xu Z, Wang X. Power attack: An increasing threat to data centers. In: NDSS....

JakobssonM. et al.

Crimeware: understanding new attacks and defenses

(2008)

WongW. et al.

Hunting for metamorphic engines

J Comput Virol

(2006)

TariqN.

Impact of cyberattacks on financial institutions

J Internet Bank Commer

(2018)

ChenL. et al.

Adversarial machine learning in malware detection: Arms race between evasion attack and defense

SchultzM.G. et al.

Data mining methods for detection of new malicious executables

ChristodorescuM. et al.

Static analysis of executables to detect malicious patternsTechnical report

(2006)

ChenJ. et al.

Detecting android malware using clone detection

J Comput Sci Tech

(2015)

CesareS. et al.

Classification of malware using structured control flow

YeY. et al.

Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

J Intell Inf Syst

(2010)

KolterJ.Z. et al.

Learning to detect malicious executables in the wild

MoskovitchR. et al.

Unknown malcode detection using opcode representation

DaiJ. et al.

Efficient virus detection using dynamic instruction sequences

J Comput Phys

(2009)

NatarajL. et al.

Malware images: visualization and automatic classification

AndersonB. et al.

Graph-based malware detection using dynamic analysis

J Comput Virol

(2011)

SantosI. et al.

Opem: A static-dynamic approach for machine-learning-based malware detection

DahlG.E. et al.

Large-scale malware classification using random projections and neural networks

SaxeJ. et al.

Deep neural network based malware detection using two dimensional binary program features

HuangW. et al.

MtNet: a multi-task neural network for dynamic malware classification

KolosnjajiB. et al.

Deep learning for classification of malware system call sequences

GrosseK. et al.

Adversarial examples for malware detection

WangQ. et al.

Adversary resistant deep neural networks with an application to malware detection

UcciD. et al.

Survey of machine learning techniques for malware analysis

Comput Secur

(2018)

SahuM.K. et al.

A review of malware detection based on pattern matching technique

Int J Comput Sci Inf Technol

(2014)

SouriA. et al.

A state-of-the-art survey of malware detection approaches using data mining techniques

Human-centric Comput Inf Sci

(2018)

BazrafshanZ. et al.

A survey on heuristic malware detection techniques

BasuI. et al.

Malware detection based on source data using data mining: A survey

Am J Adv Comput

(2016)

YeY. et al.

A survey on malware detection using data mining techniques

ACM Comput Surv

(2017)

Or-MeirO. et al.

Dynamic malware analysis in the modern era—A state of the art survey

ACM Comput Surv

(2019)

BarrigaJ. et al.

Malware detection and evasion with machine learning techniques: A survey

Int J Appl Eng Res

(2017)

DamodaranA. et al.

A comparison of static, dynamic, and hybrid analysis for malware detection

J Comput Virol Hacking Tech

(2017)

BayerU. et al.

Dynamic analysis of malicious code

J Comput Virol

(2006)

AndersonB. et al.

Improving malware classification: bridging the static/dynamic gap

RoyalP. et al.

Polyunpack: Automating the hidden-code extraction of unpack-executing malware

FredriksonM. et al.

Synthesizing near-optimal malware specifications from suspicious behaviors

Force UA. Analysis of the Intel Pentium’s ability to support a secure virtual machine monitor. In: Proceedings of the...

RutkowskaJ.

Redpill: Detect VMM using (almost) one CPU instruction

(2004)

LiangG. et al.

A behavior-based malware variant classification technique

Int J Inf Educ Technol

(2016)

Cited by (55)

MAGIC: Malware behaviour analysis and impact quantification through signature co-occurrence and regression
2024, Computers and Security
Malware poses risks by compromising both data integrity and system security. Proactive defense efforts have led to the adoption of malware scoring, allowing analysts to assess the severity and develop countermeasures. These scores indicate the degree of malware maliciousness based on triggered signatures. However, current scoring methods do not precisely depict the true extent of malware's maliciousness. This inaccuracy is attributed to an inadequate assessment of the impact of behaviour corresponding to signatures on both system and network resources. To address this limitation, the paper proposes a novel scoring approach that accurately quantifies the impact of signatures triggered by malware through co-occurrence analysis. The method assesses the ensemble behaviour of signatures across two phases. In the first phase of signature scoring, an impact score quantification algorithm initializes each signature to predefined severity score bands based on the studied severity and frequency. The second phase refines initial scores iteratively, considering mutual information among signatures co-occurring in the malware's execution. Experimental results validate the proposed method's ability to accurately reflect signature maliciousness. This novel scoring method enhances malware analysis platforms in generating more precise scores compared to traditional methods, thereby improving resilience against evolving malware threats in the dynamic cybersecurity landscape.
Analyzing and comparing the effectiveness of malware detection: A study of machine learning approaches
2024, Heliyon
The Internet has become a vital source of knowledge and communication in recent times. Continuous technological advancements have changed the way businesses operate, and everyone today lives in the digital world of engineering. Because of the Internet of Things (IoT) and its applications, people's impressions of the information revolution have improved. Malware detection and categorization are becoming more of a problem in the cybersecurity world. As a result, strong security on the Internet could protect billions of internet users from harmful behavior. In malware detection and classification techniques, several types of deep learning models are used; however, they still have limitations. This study will explore malware detection and classification elements using modern machine learning (ML) approaches, including K-Nearest Neighbors (KNN), Extra Tree (ET), Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), and neural network Multilayer Perceptron (nnMLP). The proposed study uses the publicly available dataset UNSWNB15. In our proposed work, we applied the feature encoding method to convert our dataset into purely numeric values. After that, we applied a feature selection method named Term Frequency-Inverse Document Frequency (TFIDF) based on entropy for the best feature selection. The dataset is then balanced and provided to the ML models for classification. The study concludes that Random Forest, out of all tested ML models, yielded the best accuracy of 97.68 %.
How to punish cyber criminals: A study to investigate the target and consequence based punishments for malware attacks in UK, USA, China, Ethiopia & Pakistan
2023, Heliyon
Numerous research studies have highlighted the exponential growth of malware attacks worldwide, posing a significant threat to society. Cybercriminals are becoming increasingly merciless and show no signs of pity towards individuals or organizations. It is evident that cyber criminals will stop at nothing to gain unauthorized access to confidential information. To effectively combat malware attacks, strict cyber laws are necessary, and the use of malware is punishable in many countries. However, the literature has not addressed whether these penalties create deterrence or not. This research article has addressed this gap. In this study, the effectiveness of criminal laws related to malware-related crimes in various jurisdictions was analyzed using the doctrinal research methodology. The cyber laws of the USA, UK, Ethiopia, Pakistan, and China were examined to determine whether the penalties imposed for these crimes are appropriate given the severity of the harm caused. The study concludes that malware penalties should take into account the creation or use of malicious code, targeting individuals or organizations, and the magnitude of consequences, regardless of whether mens rea is present or not.
Deep learning-enabled anomaly detection for IoT systems
2023, Internet of Things (Netherlands)
Citation Excerpt :
In supervised machine learning-based anomaly detection [19,20], the machine learning model is trained on labeled datasets, while in unsupervised machine learning-based anomaly detection [21,22], the machine learning model works on learning patterns and features using unlabeled dataset. Finally, the semi-supervised method [23,24] adopts both labeled and unlabeled datasets in the training and learning process [25]. Below we discuss recent machine learning-based anomaly detection.
Internet of Things (IoT) systems have become an intrinsic technology in various industries and government services. Unfortunately, IoT devices and networks are known to be highly vulnerable to security attacks that target data integrity and service availability. Moreover, the heterogeneity of the data collected from various IoT devices, together with the disturbances incurred within the IoT system, render the detection of anomalous behavior and compromised nodes more challenging compared to traditional Information Technology (IT) networks. As a result, there is a pressing need for effective and reliable anomaly detection to identify malicious data to guarantee that they will not be used in IoT-driven decision support systems. In this paper, we propose a deep learning-powered anomaly detection for IoT that can learn and capture robust and useful features, which cannot be significantly affected by unstable environments. These features are then used by the classifier to enhance the accuracy of detecting malicious IoT data. More specifically, the proposed deep learning model is designed based on a denoising autoencoder, which is adopted to obtain features that are robust against the heterogeneous environment of IoT. Experimental results based on real-life IoT datasets show the effectiveness of the proposed framework in terms of enhancing the accuracy of detecting malicious data compared to the state-of-the-art IoT-based anomaly detection models.
A multi-view feature fusion approach for effective malware classification using Deep Learning
2023, Journal of Information Security and Applications
Citation Excerpt :
Malware detection has been a vivid area of research and various approaches were proposed for malware detection. The detailed study and analysis on malware detection, particularly, windows executable malware detection are described in [3,21,22]. Gibert et al. in [3] performed a comprehensive survey of the malware detection and classification using ML techniques and also discussed recent trends leveraging DL approaches to defend against malware attacks.
The number of malware infected machines from all over the world has been growing day by day. New malware variants appear in the wild to evade the malware detection and classification systems and may infect with ransomware or crypto miners for adversary financial gain. A recent colonial pipeline ransomware attack is an example of these attacks that impacted daily human activities, and the victim had to pay ransom to restore their operations. Windows-based systems are the most adopted systems across different industries for running applications. They are prone to get targeted by installing the malware. In this paper, we propose a Deep Learning (DL)-based Convolutional Neural Network (CNN) model to perform the malware classification on Portable Executable (PE) binary files using the fusion feature set approach. We present an extensive performance evaluation of various DL model architecture and Machine Learning (ML) classifier i.e. Support Vector Machine (SVM), on multi-aspect feature sets covering the static, dynamic, and image features to select the proposed CNN model. We further leverage the CNN-based architecture for effective classification of the malware using different combinations of feature sets and compare the results with the best-performed individual feature set. Our performance evaluation of the proposed model shows that the model classifies the malware or benign files with an accuracy of 97% when using fusion feature sets. The proposed model is robust and generalizable and showed similar performances on completely unseen two malware datasets. In addition, the embedding features of the CNN model are visualized, and various visualization methods are employed to understand the characteristics of the datasets. Further, large-scale learning and stacked classifiers were employed after the penultimate layer to enhance the CNN classification performance.
Evaluating the Efficacy of Prompt-Engineered Large Multimodal Models Versus Fine-Tuned Vision Transformers in Image-Based Security Applications
2024, arXiv

View all citing articles on Scopus

View full text

Malware classification and composition analysis: A survey of recent developments

Abstract

Introduction

Section snippets

Related surveys

Taxonomy of malware classification

Taxonomy of composition analysis techniques

Characterization of surveyed papers

Challenges and issues

Research direction

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Int J Crit Infrastruct Prot

J Netw Comput Appl

Inf Secur Tech Rep

Eng Appl Artif Intell

Comput Secur

J Syst Softw

Inform Sci

Evolution, detection and analysis of malware for smart devices

IEEE Commun Surv Tutor

A comprehensive survey: ransomware attacks prevention, monitoring and damage control

Int J Res Sci Innov

10 years of IoT malware: A feature-based taxonomy

Crimeware: understanding new attacks and defenses

Hunting for metamorphic engines

J Comput Virol

Impact of cyberattacks on financial institutions

J Internet Bank Commer

Adversarial machine learning in malware detection: Arms race between evasion attack and defense

Data mining methods for detection of new malicious executables

Static analysis of executables to detect malicious patternsTechnical report

Detecting android malware using clone detection

J Comput Sci Tech

Classification of malware using structured control flow

Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list

J Intell Inf Syst

Learning to detect malicious executables in the wild

Unknown malcode detection using opcode representation

Efficient virus detection using dynamic instruction sequences

J Comput Phys

Malware images: visualization and automatic classification

Graph-based malware detection using dynamic analysis

J Comput Virol

Opem: A static-dynamic approach for machine-learning-based malware detection

Large-scale malware classification using random projections and neural networks

Deep neural network based malware detection using two dimensional binary program features

MtNet: a multi-task neural network for dynamic malware classification

Deep learning for classification of malware system call sequences

Adversarial examples for malware detection

Adversary resistant deep neural networks with an application to malware detection

Survey of machine learning techniques for malware analysis

Comput Secur

A review of malware detection based on pattern matching technique

Int J Comput Sci Inf Technol

A state-of-the-art survey of malware detection approaches using data mining techniques

Human-centric Comput Inf Sci

A survey on heuristic malware detection techniques

Malware detection based on source data using data mining: A survey

Am J Adv Comput

A survey on malware detection using data mining techniques

ACM Comput Surv

Dynamic malware analysis in the modern era—A state of the art survey

ACM Comput Surv

Malware detection and evasion with machine learning techniques: A survey

Int J Appl Eng Res

A comparison of static, dynamic, and hybrid analysis for malware detection

J Comput Virol Hacking Tech

Dynamic analysis of malicious code

J Comput Virol

Improving malware classification: bridging the static/dynamic gap

Polyunpack: Automating the hidden-code extraction of unpack-executing malware

Synthesizing near-optimal malware specifications from suspicious behaviors

Redpill: Detect VMM using (almost) one CPU instruction

A behavior-based malware variant classification technique

Int J Inf Educ Technol