Efficient feature extraction methodologies for unknown MP4-Malware detection using Machine learning algorithms

https://doi.org/10.1016/j.eswa.2023.119615Get rights and content

Abstract

We are living in an era in which daily interaction between individuals and businesses involves sending, uploading, and sharing videos as a means of communication and advertising. However, many users are unaware of the risks associated with opening a malicious video file, it is thus no surprise that cyber-criminals have taken advantage of this situation and adopted this attack vector in recent years. MP4 is one of the most commonly used video formats, and its properties make it well-suited for software vulnerability exploitation across multiple platforms, which can ultimately lead to a cyberattack. Due to their deterministic, signature-based technique, antivirus software solutions are limited in their ability to detect unknown malware, let alone zero-day attacks. Machine learning (ML) algorithms have been effective in detecting known and unknown malware across various file formats, domains, and platforms. ML algorithms’ performance relies heavily on the feature extraction methodology. However, to the best of our knowledge, there is no designated and specialized feature extraction methodology for MP4 files which generates a set of features for the task of unknown MP4 file malware detection. In this paper, we present three innovative and efficient feature extraction methodologies for unknown MP4 file malware detection. Two of them are file structure-based and one is knowledge-based. The methodologies are evaluated in a series of five experiments using six ML algorithms and 177 different datasets which represent different configurations of feature extraction, representation, and selection. The datasets are based on a representative collection of 6,229 files − 5,066 benign (∼81.3 %) files and 1,163 malicious files (∼18.7 %). The first three experiments demonstrate the methodologies’ discrimination and generalization capabilities across multiple configurations, in terms of known and unknown MP4 file malware detection. The fourth experiment shows that applying principal component analysis (PCA) on the features suggested by the methodologies can improve time and space complexity and feature resilience while maintaining strong detection and generalization capabilities. In the fifth experiment, the methodologies’ best performing configuration is compared to state-of-the-art, generic feature extraction methodologies, such as n-grams, MinHash, and representation and transfer learning (using a CNN), in the task of unknown MP4 file malware detection. The results show that our best performing configuration outperforms all other state-of-the-art feature extraction methodologies with an AUC, TPR, and FPR of 0.9951, 0.976, and 0.0 respectively.

Introduction

In mid-November 2019, Facebook made a discloser1 regarding a severe vulnerability in the WhatsApp messenger app - CVE-2019-11931.2 The vulnerability allowed attackers to remotely execute arbitrary code or cause a denial-of-service3 (DoS) attack using a specially crafted MP4 file. The exploit stemmed from the way the app parsed elementary metadata streams. Facebook did not reveal many technical details, although examples can be found online4 Facebook also didn’t reveal whether this was a zero-click attack,5 i.e., the device is infected without the need for any user action in the process. In this case, over 1.5 billion active users were at immediate risk. The above scenario is just one example of how a media file, can be used as an attack vector. In July 2019, Symantec discovered another attack vector called Media File Jacking6 which allowed attackers to modify videos and images, and change their content without the users’ knowledge, on both WhatsApp and Telegram; such an attack enables the attacker to bypass end-to-end encryption and potentially transform a user trusted video into a malicious one. The latter example demonstrates how attackers have exploited popular communication and entertainment trends associated with today’s combination of attractive content and high-speed connections, in which users download and share MP4 files without considering the potential malicious outcomes; attackers use social engineering to lure and manipulate victims and cause them to execute and propagate malicious files. For instance, in 2014 Trojan.FakeFlash.A7 malware started appearing on Facebook ads promoting naked videos of other Facebook friends. There were no naked videos, but the ad led users to a fake YouTube page aimed at encouraging them to install a fake Adobe Flash Player update to be able to see the videos. This malware infected two million users’ devices directly and convinced many more to at least click on the ad.

Videos are non-executable file formats which users generally consider safer than executable formats. This misconception stems from the fact that non-executable files can only be decoded using dedicated programs, i.e., malicious video files can only exploit vulnerabilities of programs that are devoted to decoding their format. However, once a vulnerability is exploited, non-executable files are as dangerous as executable files, allowing an attacker to perform any malicious actions needed. Moreover, media players are frequently used software. Security Intelligence8 reported that over 1,200 vulnerabilities were discovered in the National Vulnerabilities Database9 (NVD) from 2000 through 2014. At the time of this writing, we identified 155 published software vulnerabilities directly related to MP4, which were published from 2014 through late 2019. Note that 105 of those vulnerabilities were published since 2017, indicating the increasing risk associated with this attack vector.

MP4 is among the most used media file formats. Therefore, it has nearly unlimited potential of being used to execute a widespread attack. There are no direct available statistics regarding the popularity of MP4 file format, nevertheless, some correlated statistics might. For example, Ecoding.com (one of the world’s largest cloud-based media processing providers), reported10 that the mostly codec for over-the-top11 (OTT) content is H. 264 which generally refers to MP4 files. Another example is that the recommended12 format for uploading a video to YouTube is MP4. 207 TB, and 432,000 h of new content is estimated13 to be uploaded to YouTube every day. Moreover, MP4 is a commonly used format for uploading videos in most social networks.

Today, common antivirus software uses signature-based14 malware detection; known for identifying known malware. Unfortunately, new variants of malicious samples appear in the wild every day. Hence, new malware, or even new variants of known malware, could be classified as benign using signature-based methodologies.

Taking into account the above, we would like to highlight three main factors that motivated our study: (1) MP4 is among the most popular media formats; (2) MP4 is a non-executable format, which users often prefer to use, since it cannot be decoded without a dedicated application; and (3) there are no machine learning-based methods or tools that address the challenge of unknow MP4 detection; and as far as we could find there are no publicly available studies (comprehensive or partial) that have been conducted in this domain.

Following the abovementioned, it is surprising that until now, no comprehensive (or partial) studies have been conducted on the paper’s domain. In this study, we present three novel methodologies for the efficient extraction of features from MP4 files, which can be effectively used to train machine learning (ML) algorithms for the purpose of unknown MP4 file malware detection. This paper’s contributions are:

  • -

    Proposing three novel feature extraction methodologies for known and unknown MP4 file malware detection.

  • -

    Implementing our proposed feature extraction methodologies using different ML algorithms for detection of known and unknown MP4 file malware.

  • -

    Determining the best configurations of feature extraction, feature representation, feature selection, top N selected features, and ML algorithms for unknown MP4 file malware detection.

  • -

    Improving our proposed methodologies with PCA, in terms of time and space complexity. PCA over the features creates encapsulation and resilience, yet still maintains a high level of generalization. capabilities

  • -

    Comparing the best performing configurations with state-of-the-art feature extraction methodologies on the task of unknown MP4 file malware detection based on performance and processing efficiency.

  • -

    Detection of unknown MP4 file malware across multiple platforms: Windows, Mac, Linux, Android, etc.

  • -

    Creation of a representative up-to-date collection of malicious and benign MP4 files for future research.

The rest of the paper is organized as follows. Section 2 presents an overview of the MP4 file structure, common vulnerabilities, and a possible attack scenario. Section 3 surveys related work on ML-based non-executable files unknown malware detection. Section 4 describes the methods and tools we applied to build a framework aimed at improving unknown MP4 file malware detection capabilities. Section V discusses the evaluation metrics used and the experiments conducted to assess the proposed methodologies, and presents the results obtained. In section 6 we discuss our findings in relation to the methodology’s performance. We conclude in sections VII-IX by discussing the methodologies’ limitations and future work.

Section snippets

Background

MPEG-4 is an ISO/IEC standard developed by the Moving Picture Experts Group (MPEG). The ISO base media file format was based on Apple’s QuickTime container format. MP4 is an instance of the ISO base media file format) ISO/IEC 14496-12, ISO/IEC 15444-12).15 MP4 can be referred to as the official file extension or file format designed to contain the media information of a MPEG-4 presentation in a flexible,

Related work

In this section, we provide a summary of previous studies related to our field of research. To the best of our knowledge, no comprehensive (or partial) studies have been performed in this area. We briefly survey a few peripheral domains, concentrating on non-executable files malware detection methods and techniques as means of better understanding how the structure and properties of an MP4 file might be leveraged to extract informative features for malware detection.

Methods

In this section, we describe the methods proposed and used in our study. Before we do so we first provide a short explanation regarding the MP4 file structure.

Evaluation

In this section, we describe the evaluation procedures, research questions, and experimental design.

Results

First, we discuss the results of our exploration of features extracted by the proposed methodologies (see Fig. 9, Fig. 10, Fig. 11), to get a sense of their ability to discriminate between malicious and benign MP4 files. Benign samples (which appear in red in the figures) and malicious samples (in blue). The feature extraction methods for the benign and malicious samples were visualized in 2D using t-distributed stochastic neighbor embedding (t-SNE) (Van Der Maaten and Hinton, 2008) technique.

Discussion and conclusions

In this paper, we presented efficient feature extraction methodologies for unknown MP4 file malware detection using machine learning algorithms. We evaluated our methodologies’ discrimination and generalization capabilities through a series of five comprehensive experiments. We used a large representative dataset of 6,229 MP4 files, which consisted of 5,066 benign samples (∼81.3 %) and 1,163 malicious samples (∼18.7 %); the malicious samples were divided into majority and minority classes,

Limitations

As discussed in section 4, our feature extraction methodologies are based on the structural frame or the metadata and meta-features of an MP4 file. Hence, the methodologies do not parse the actual content of the file; like raw data objects, usually found in the mdat atom (e.g., video and objects). To the best of our knowledge, there has been no documentation of a malicious file or cyberattack that integrated a specially crafted MP4 file that contained malicious content inside a raw data object;

Coping with possible attacks

Extensive research has been performed made on the vulnerability of ML algorithms to adversarial attacks. In this section context, we focus on defense mechanisms against black-box evasion attacks, as described by (Papernot, McDaniel, & Goodfellow, 2016). Defense against evasion attacks presumes that the attacker has access to or knowledge of the trained classifier and/or input datasets. In our case, we assume that an attacker has limited access to the above but can query the model and use it as

Future work

In this paper, we have evaluated our suggested framework of innovative and efficient feature extraction methodologies for unknown MP4 file malware detection. As described in section 2, an MP4 file is an instance of the ISO base media file format. Other instances65 are 3GPP, 3GPP2, JPEG-2000, F4V, MPEG-21, etc. Therefore, we aim to further extend our framework to develop other unique methodologies that can extract informative and discriminative features for unknown malware

CRediT authorship contribution statement

Tal Tsafrir: Data curation, Formal analysis, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. Aviad Cohen: Data curation, Software, Writing – review & editing. Etay Nir: Data curation, Software, Writing – review & editing. Nir Nissim: Conceptualization, Funding acquisition, Investigation, Supervision, Data curation, Formal analysis, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing –

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (65)

  • Broder, Andrei Z. 1997. “On the Resemblance and Containment of Documents.” Pp. 21–29 in Proceedings of the...
  • Chang, Chih-Chung, and Chih-Jen Lin. 2001. LIBSVM: A Library for Support Vector...
  • Chio, C., David F., Copyright C.C., and David F. 2017. “Machine Learning and Security Clarence.”...
  • Cohen, A., Nir N., and Yuval E. 2020. “MalJPEG: Machine Learning Based Solution for the Detection of Malicious JPEG...
  • Z. Cui et al.

    Detection of malicious code variants based on deep learning

    IEEE Transactions on Industrial Informatics

    (2018)
  • S. Ford et al.

    Analyzing and Detecting Malicious Flash Advertisements

  • J. Fridrich et al.

    Steganalysis of JPEG Images: Breaking the F5 Algorithm

    (2003)
  • Fu, D., and Feiyue S. 2012. “Buffer Overflow Exploit and Defensive Techniques.” Proceedings - 2012 4th International...
  • Fuyong, Z., and Zhao, T. 2017. “Malware Detection and Classification Based on N-Grams Attribute Similarity.” Pp. 793–96...
  • Gu, Q., Zhenhui L., Jiawei, H. n.d. Generalized Fisher Score for Feature...
  • Hao Lee, W., Murali S. R., and Krishnan, S. P. T. n.d. On Designing an Efficient Distributed Black-Box Fuzzing System...
  • He, K., Xiangyu, Z., Shaoqing, R., and Jian, S. 2016. “Deep Residual Learning for Image Recognition.” Pp. 770–78 in...
  • He, L., Yan, C., Hong, H., Purui, S., Zhenkai, L., Yi, Y. 2017. “Automatically Assessing Crashes from Heap Overflows.”...
  • Hiester, L. 2018. “File Fragment Classification Using Neural Networks with Lossless Representations Networks with...
  • Y.-S. Jeong et al.

    Malware detection on byte streams of PDF files using convolutional neural networks

    Security and Communication Networks

    (2019)
  • Jia, X., Chao, Z., Purui, S., Yi, Y., Huafeng, H., and Dengguo, F. 2017. “Towards Efficient Heap Overflow Discovery.”...
  • Jordaney, R., Royal, H., Davide, P., Elettronica, SpA., Ilia, N., Lorenzo, C., Kumar, S., Santanu Kumar, D., Zhi, W....
  • Jung, W., Sangwon, K., Sangyong, C. 2015. Deep Learning for Zero-Day Flash Malware...
  • Kalash, M., Mrigank, R., Noman, M., Neil, D. B. Bruce, Yang, W., and Farkhund, Iqbal. 2018. “Malware Classification...
  • Khurana, M., Ruby, Y., Meena, K. n.d. “Buffer Overflow and SQL Injection: To Remotely Attack and Access...
  • Kolter, J. Z., and Marcus, A. M. 2004. “Learning to Detect Malicious Executables in the Wild.” Pp. 470–78 in KDD-2004 -...
  • Kunwar, R. S., and Priyanka, S. 2018. “Framework to Detect Malicious Codes Embedded with JPEG Images over Social...
  • Cited by (2)

    Peer review under responsibility of Submissions with the production note ‘Please add the Reproducibility Badge for this item’ the Badge and the following footnote to be added:The code (and data) in this article has been certified as Reproducible by the CodeOcean: https://codeocean.com. More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physicalsciencesandengineering/computerscience/journals.

    View full text