Static malware detection and attribution in android byte-code through an end-to-end deep system

doi:10.1016/j.future.2019.07.070

Future Generation Computer Systems

Volume 102, January 2020, Pages 112-126

https://doi.org/10.1016/j.future.2019.07.070 Get rights and content

Highlights

•
Deep learning based Android Malware detection System.
•
Augmentation of Large-scale byte code dataset.
•
End-to-End Feature learning.
•
Static detection of Malware from byte code dataset.

Abstract

Android reflects a revolution in handhelds and mobile devices. It is a virtual machine based, an open source mobile platform that powers millions of smartphone and devices and even a larger no. of applications in its ecosystem. Surprisingly in a short lifespan, Android has also seen a colossal expansion in application malware with 99% of the total malware for smartphones being found in the Android ecosystem. Subsequently, quite a few techniques have been proposed in the literature for the analysis and detection of these malicious applications for the Android platform. The increasing and diversified nature of Android malware has immensely attenuated the usefulness of prevailing malware detectors, which leaves Android users susceptible to novel malware. Here in this paper, as a remedy to this problem, we propose an anti-malware system that uses customized learning models, which are sufficiently deep, and are ’End to End deep learning architectures which detect and attribute the Android malware via opcodes extracted from application bytecode’. Our results show that Bidirectional long short-term memory (BiLSTMs) neural networks can be used to detect static behavior of Android malware beating the state-of-the-art models without using handcrafted features. For our experiments in our system, we also choose to work with distinct and independent deep learning models leveraging sequence specialists like recurrent neural networks, Long Short Term Memory networks and its Bidirectional variation as well as those are more usual neural architectures like a network of all connected layers(fully connected), deep convnets, Diabolo network (autoencoders) and generative graphical models like deep belief networks for static malware analysis on Android. To test our system, we have also augmented a bytecode dataset from three open and independently maintained state-of-the-art datasets. Our bytecode dataset, which is on an order of magnitude large, essentially suffice for our experiments. Our results suggests that our proposed system can lead to better design of malware detectors as we report an accuracy of 0.999 and an F1-score of 0.996 on a large dataset of more than 1.8 million Android applications.

Introduction

Android represents a revolution in smartphones and mobile devices. It is an open source virtual machine based mobile platform powering countless smartphones, tablets, smart TVs, Car entertainment systems, and set top boxes etc. This means that developers have access to the source code, which can then be used to implement new features, add innovative attributes, and enhance the user experience.

Freelance developers also leverage Android to tinker and experiment with their ideas and offer a unique software platform often termed as ROM. Tech savvy users can then deploy such customized ROMs on their devices in case they are not satisfied with the user experience provided by the stock firmware installed on the device by its manufacturer. Lineage OS, Paranoid Android, and Resurrection remix are examples of such community driven ROMs. Therefore, while on other smartphone software you have the choice to choose from an array of applications (apps) available for the platform, Android is unique in a sense that in its ecosystem you can switch from stock firmware to custom ROMs, which are analogous to the choice you get in form of different Linux distributions. However, custom roms are not always a safe initiative to follow [1].

Another unique feature of Android platform is that applications are available from not only the native Google Play Store but also a variety of third party app sources. While apps downloaded and installed from the native app store can reasonably be trusted, the same cannot be said about the apps installed from third party sources [2], [3], [4].

However, the advent of malicious applications exploiting the ease of access provided for the above-mentioned software and hardware makes a nightmare situation for conventional malware detection systems to handle these malicious apps. Android has 3.5 million applications [5] in its ecology and 99% of the total malware is targeted towards Android [6].

On top of that, the native Google Play has 250 apps categorized as AntiVirus (AV), however 66% of the samples (170 apps) does not pass the AV criteria. These applications do not successfully distinguish malicious apps. They simply use whitelists and blacklists to sift through undesirable applications. The greater part of these applications are just a publicizing the PlayStore with a phony interface [7].

In the beginning, diagnoses of malware for Android was simple, since it showcased uncomplicated malicious attributes which could merely be named out by simply observing the tasks they performed. Eventually, malware for Android has progressively become unpredictable and have opted the more aggressive techniques. As an answer to this, researchers in malware security have proposed a variety of detection mechanisms. For a quick and brief summary Section 3 can be seen.

Problem: The customizability of Android through custom firmware and third party applications combined with the sensitivity of the information stored on these devices necessitates the development and implementation of cutting edge detection measures, to ensure the security of user and the device itself, and to apprehend the malware. This clearly depicts that the routine approaches for app defense are not sufficient to limit the ever-growing Android malware. Hence, the succeeding genesis of malware detection is highly desirable which can achieve the following objectives:

Objectives:

1.
Mirror the human effort of creating and extracting features via an automatic method and offload the burden to machines to learn progressively over time for correctly and efficiently carry out the feature engineering process.
2.
which is less prone to false alarms.
3.
the devised mechanism should support the large variety of development platforms.

Contribution: As a remedy in this paper, we resolve the dilemma of Android malware detection and attribution through a proposed anti-malware system of deep learning models which caters for the above mentioned objectives. Our devised system, which utilizes tailored variety of available models and essentially deal sequences such as recurrent neural networks (RNNs) [8] and its varied versions of Long short term memory (LSTM) [9] models and Bidirectional long short term memory (BiLSTM), others like deep autoencoders [10], deep belief networks (DBNs) [11] and those which have outperformed many vision benchmarks like fully connected neural networks [12], convolutional neural networks (CNNs) [13].

The details for design, customization and implementation of all these models, whereas the motivation backing the selection of these models and hyperparameters are given section wise. We show that our custom trained models can automatically engineer and learn features from an unlabeled and large input space (dataset) for malware detection similar to the tasks in category of vision and speech recognition [14], [15], [16]. This automatic and end-to-end feature engineering, classification [17] and malware attribution ability of these deep models equates for the human effort of manually analyzing the sample space and creating the features and static classification. Note that, analyzing malware without the need of executing is the essence of static analysis which is essentially helpful on low-power and memory-limited Android devices.

For our experiments, we also augment a large dataset of opcodes extracted from Android package kit(apks) and vectorize them through one-hot encoding. We aim to make this dataset freely available on demand. We also show that our system outperforms previous state-of-the-art.

Derived from our observations and experimentation through our deep learning system via BiLSTM achieves an accuracy of 0.99 and F1 score of 0.99, we conclude that our facts & figures prove the superiority of our tailored and trained versions of deep models over conventional machine learning methods in contemporary approaches towards mobile malware detection for Android. The rest of the paper is organized as followed:

Paper structure: In Section 2, rudimentary and essential details are provided, which accounts for malware analysis in the domain of Android (Section 2.1) and deep learning (Section 2.4). Section 3 describes the related work. Section 3 is then followed by the explanation of formal design and training of different deep learning architectures in the realm of malicious app. analysis for Android as Section 4. Our used tools are summarized in Section 5 as experimental setup. Computed results of our experimentation for all our models are given in Section 6. Finally, in Section 7 we provide the conclusion of our work.

Section snippets

Android malware analysis

Of late, numerous malware detection efforts have been made for vm based platforms like Android, in which dynamic analysis and static analysis approaches are the major directions among these attempts.

Dynamic analysis

The first approach for detection of malware is Dynamic Analysis. Dynamic analysis is considered as advance method of system analysis. In this type of analysis, system is analyzed at run time, for example [18]. Dynamic analysis seeks for the behavior of program. Analysis of the application is

Related work

In the section below, we provide the work closely related to our approach. We review necessary approaches for the sake of comparison and overview. Note that state-of-the-art like [27], [28], [29], [30], and [31] are discussed in Section 6.1. The rest of the related work is discussed in this section and summarized in Table 1.

ANASTASIA [32] is based on the static analysis of application behavior. The author used Machine Learning classification algorithms, which included ensemble, eXtreme Gradient

Proposed system for malware detection

In this section, we provide the details of our anti-malware system for Android platform. Our system is based upon the deep neural networks introduced in Section 2.4 and discussed in this section down below. The system is devised in a layered architecture depicted in Fig. 1.

input-step1, .dex extraction-step2 and preprocessing-step3: Remember that we have two separate test beds, one deployed over a GPU based machine and second over a daily life laptop. Besides, we have used 3 datasets of Android

Experimental setup

Our dataset nomenclature was discussed in Section 4.1. Now we turn towards the brief outlines of our experimental setup.

Each application’s data is converted to the one-hot representation of the opcodes extracted through static analysis from the byte-code directly. Note that we have evaluated the predictions of our deep learning models by splitting our dataset in to training and testing sets with a ratio of 60% and 40% respectively. We also performed standard 10-fold cross validation. First, we

Results & discussion

In this section, we portray some convincing points of interest of how our learning models have been evaluated and scored concerning our dataset. Afterwards, we will come back to the performance details on deep learning based models that we used contrasted with the best in class for classifying Android malware.

As a rule of thumb, the confusion matrix for a binary classification is a collection of 4 classes namely: true positives (TP), true negatives (TN), false positives (FP) and false negatives

Conclusions and future directions

In this paper, we have investigated the static malware detection for Android based smartphones via application of deep learning.

We have shown the nomenclature of a large-scale byte-code dataset for Android malware analysis and applied & shared the efficacy of models like fully connected neural networks, convolutional neural networks, autoencoders, deep belief networks and recurrent neural networks through architectural and empirical evidences. Our results not only match but in most cases beat

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We extend our gratitude to the NVIDIA Corporation for provisioning us with a Tesla K40c GPU, as the resource hungry testing and training of neural network were made only viable by this GPU. This research would have not been possible without the datasets freely available. We are thankful to Android Malware Dataset (AMD) [38], Drebin [27] and the VirshShare site [39] for upholding these contributions till date.

Muhammad Amin is an Assistant Professor at National University of Computer & Emerging Sciences. His interest is machine learning in large distributed Environments. He is currently Working towards incorporating machine learning in large scale malware analysis.

References (56)

HornikK. et al.
Multilayer feedforward networks are universal approximators
Neural Netw.
(1989)
BiswasA. et al.
Approximate distance fields with non-vanishing gradients
Graph. Models
(2004)
FunahashiK.i. et al.
Approximation of dynamical systems by continuous time recurrent neural networks
Neural Netw.
(1993)
KarbabElMouatez Billah et al.
Automatic framework for android malware detection using deep learning
FeizollahAli et al.
Androdialysis: analysis of android intent effectiveness in malware detection
Nguyen Tan Can, Pham Van Hau, Anh Nguyen Tuan, Detect security threat in android custom firmware by analyzing...
ZhouWu et al.
Detecting repackaged smartphone applications in third-party android marketplaces
Google Android: Annual Android Security Year in Review....
Derr Erik, The Impact of Third-party Code on Android App Security, journal Enigma 2018, Enigma 2018, USENIX...
Number of Android apps. Acccessed from:...

Share of Android malware. Acccessed from:...

Android AV test malware. Acccessed from:...

YangZ. et al.

Neural machine translation with recurrent attention modeling

(2016)

HochreiterS. et al.

Long short-term memory

Neural Comput.

(1997)

HintonG.E. et al.

Reducing the dimensionality of data with neural networks

Science

(2006)

SalakhutdinovR. et al.

On the quantitative analysis of deep belief networks

LeCunY. et al.

Convolutional networks for images, speech, and time series

FarabetClement et al.

Learning hierarchical features for scene labeling

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

KrizhevskyAlex et al.

Imagenet classification with deep convolutional neural networks

DengLi et al.

Recent advances in deep learning for speech research at microsoft

SzegedyC. et al.

Inception-v4, inception-resnet and the impact of residual connections on learning

(2016)

S.K. Dash, G. Suarez-Tangil, S. Khan, K. Tam, M. Ahmadi, J. Kinder, L. Cavallaro, Droidscribe: Classifying android...

SuykensJ.A. et al.

Least squares support vector machine classifiers

Neural Process. Lett.

(1999)

V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th...

ChungJ. et al.

Gated feedback recurrent neural networks

(2015)

HintonG.

A practical guide to training restricted boltzmann machines

Momentum

(2010)

LeCunY. et al.

Deep learning

Nature

(2015)

HintonG.E. et al.

The wake-sleep algorithm for unsupervised neural networks

Science

(1995)

Cited by (77)

A malware detection model based on imbalanced heterogeneous graph embeddings
2024, Expert Systems with Applications
The proliferation of malware in recent years has posed a significant threat to the security of computers and mobile devices. Detecting malware, especially on the Android platform, has become a growing concern for researchers and the software industry. This paper proposes a new method for detecting Android malware based on unbalanced heterogeneous graph embedding. First of all, most malware datasets contain an imbalance of malicious and benign samples, since some types of malware are scarce and difficult to collect. Thus, as a result of this problem, the classification algorithm is unable to analyze the minority samples through sufficient data, resulting in poor downstream classifier performance, in light of the fact that adversarial generation networks possess the characteristic of completing data, an algorithm for generating graph structure data is presented, in which nodes are generated to simulate the distribution of minority nodes within a network topology. Then, considering that heterogeneous information networks have the characteristics of retaining rich node semantic features and mining implicit relationships, heterogeneous graphs are used to construct models for different types of entities (i.e. Apps, APIs, permissions, intents, etc.) and different meta-paths. Finally, a new method is introduced to alleviate the over-smoothing phenomenon of node information in the propagation of deep network. In the deep GCN, we first sample the leader nodes of each layer node, and then add a residual connection and an identity map in order to determine the characteristics of the high-order leader. In this paper, a self-attention-based semantic fusion method is also applied to adaptively fuse embedded representations of software nodes under different meta-paths. The test results demonstrate that the proposed IHODroid model effectively detects malicious software. In the DREBIN dataset, which consists of 123,453 Android applications and 5,560 malicious samples, the IHODroid model achieves an accuracy of 0.9360 and an F1 score of 0.9360, outperforming other state-of-the-art baseline methods.
MIGAN: GAN for facilitating malware image synthesis with improved malware classification on novel dataset
2024, Expert Systems with Applications
Malware visualization is a technique wherein malware binaries are represented as grayscale or color images in order to identify and extract discriminating features for classification. This technique is effectively better than classic machine learning based malware recognition techniques that require significant domain expertise or time-consuming behavioral analysis to identify discriminating features. In this manuscript, a Generative Adversarial Network (GAN) architecture is introduced for facilitating malware image synthesis called ‘MIGAN’, that can quickly produce high-quality synthetic malware images and then classify malware samples into families. The proposed framework consists of a generator and discriminator network paired with a classification module. The novelty exists in the GAN network structure, hybrid loss function, new dataset and classification network structure. The MIGAN generated images manage to achieve better Inception Score than original malware images (2.81 vs 1.90, respectively) along with better Fréchet Inception Distance score and Kernel Inception Distance score. The synthetic malware images primarily serve two purposes: firstly, it solves the class imbalance problem in custom built and public ‘Malimg’ datasets. Secondly, since these images resemble existing malware images, it is assessed to be fairly similar to upcoming ‘zero-day’ or ‘previously unseen’ malware that can be eventually discovered in the future. The two classification networks (custom classification network with traditional learning approach and pretrained Resnet50v2 network with transfer learning approach) were supplemented and trained with nearly 50,000 synthetic malware images. The proposed framework achieved promising scores of 99.2 % Area Under the Curve (AUC), 99.3 % F1-score and 99.5 % Accuracy. The comprehensive evaluation and excellent results demonstrate the effectiveness of the proposed framework. This framework can also be applied to image synthesis with several other types of images.
Android malware detection method based on graph attention networks and deep fusion of multimodal features
2024, Expert Systems with Applications
Currently, Android malware detection methods always focus on one kind of app feature, such as structural, semantic, or other statistical features. This paper proposes a novel Android malware detection method that integrates multiple features of Android applications. First, to effectively extract the structural and semantic features, we propose a new type of call graph named the class-set call graph (CSCG) that uses the sets of Java classes as nodes and the call relationships between class sets as edges, and we design a dynamic adaptive CSCG construction method that can automatically determine the node size for applications with different scales. The topic model is used to mine the source code semantics from the class sets as the node features. Then, we use a graph attention network (GAT) with max pooling to extract the CSCG feature that encompass both the semantic and structural features of the Android application. Furthermore, we construct a deep multimodal feature fusion network to fuse the CSCG features with permission features. Experimental results show that our method achieves a detection accuracy of 97.28%–99.54% on the three constructed datasets, which is better than the existing methods.
Cyber security and beyond: Detecting malware and concept drift in AI-based sensor data streams using statistical techniques
2023, Computers and Electrical Engineering
In the Industrial Internet of Things (IIoT), mobile devices can be used to remotely monitor and control industrial processes, equipment, and machinery. They can also be used to optimize production and maintenance processes, improve safety, and increase efficiency in industries such as manufacturing, energy, and transportation. The adoption of IIoT has the potential to increase production and efficiency, but it also raises new cybersecurity concerns since interconnected industrial systems are more susceptible to malware intrusions. Malware attacks on IIoT systems can have grave consequences, including production delays, data loss, and physical asset damage. To aid this we propose to use statistical drift detection methods to perceive any change in data patterns and train the machine learning classifiers to counter newly developed malware samples then and there. Our results with an accuracy of 95.2% and F1-score of 94% indicate that our approach is highly successful and easy to adopt.
HCL-Classifier: CNN and LSTM based hybrid malware classifier for Internet of Things (IoT)
2023, Future Generation Computer Systems
Citation Excerpt :
The researchers reported that RF was the only classifier to achieve better results with little fine-tuning. Amin et al. [12] utilized a pre-trained CNN model to extract features from processed data streams. They then introduced an optimized deep autoencoder (DAE) to learn temporal changes of the actions in the surveillance stream.
This paper highlights a hybrid static classifier based on CNN and bidirectional LSTM for Malware classification tasks in the IoT. Our approach learns and takes note of the nature and complex patterns of the Byte and Assembly files represented in one-dimensional images to enable better feature extraction, and does not require any expertise. CNN is used for automatic feature selection and extraction. In addition, the extracted features are forwarded to the bidirectional LSTM for classification. Extensive experiments were conducted with the Microsoft Malware classification dataset and the IoT Malware dataset. The experimental results show that our HCL-Classifier achieves an average of 99.91% and 99.83%, respectively, outperforming traditional single-input state-of-the-art works. Moreover, the least performed classifier among the baseline models used in this work, such as Random Forest, achieved 97.66% accuracy. We attribute this to the nature of our 1D image representation. This study also discovered that the different files in the dataset contain specific features that differ from file to file, which we demonstrated visually and through experiments.
Static Malware Analysis Using Low-Parameter Machine Learning Models
2024, Computers

View all citing articles on Scopus

Tamleek Ali is an Associate Professor at Institute of Management Sciences. His specialty is remote attestation and trust in large distributed environments. He has several research publications in conference and international journals. He is currently working towards incorporating machine learning in large-scale malware analysis.

Mohammad Tehseen is Ph.D scholar and pursuing his Ph.D degree from department of computer science, University of Peshawar, Pakistan. He has published several papers in reputed journals and conferences. His research interests include Wireless Sensor Networks, Middleware, IoT, Wireless Sensor Based Applications, Deep Learning and Signal Processing.

Murad Khan Received the B.S. degree in computer science from university of Peshawar Pakistan in 2008. He completed his Ph.D. degree in computer science and engineering from School of Computer Science and Engineering in Kyungpook National University Daegu, Republic of Korea, Dr. Khan published over 60 International conference and Journal papers along with two books chapters in Springer and CRC press. He also served as a TPC member in world reputed conferences and as a reviewer in numerous journals such as Future Generation Systems (Elsevier), IEEE Access, etc. In 2016, he was awarded with Qualcomm innovation award at Kyungpook National University for designing a Smart Home Control System. He was also awarded with Bronze Medal in ACM SAC 2015, Salamanca, Spain, on his distinguished work in Multi-criteria based Handover Techniques. He is a member of various communities such as ACM and IEEE, CRC press, etc. His area of expertise includes ad-hoc and wireless networks, architecture designing for Internet of Things, and Communication Protocols designing for smart cities and homes, Big Data Analytics, etc.

Fakhri Alam Khan is Associate Professor at the Institute of Management Sciences, Peshawar, Pakistan. He received his PhD (with Distinction) in Computer Science in 2010 from the Institute of Scientific Computing, University of Vienna, Austria. His research interests include scientific workflows provenance, energy efficiency in WSN, multimedia technologies, nature inspired meta-heuristic algorithms, and workflow parameters significance measurement. He has published several research papers in various reputed peer-reviewed internationally recognized journals.

Sajid Anwar is an Associate Professor in Center of Excellence in Information Technology Institute of Management Sciences (IMSciences), Peshawar, Pakistan. He earned his B.Sc. and M.Sc degree in computer science from University of Peshawar in 1997 and 1999, respectively. He completed MS degree (Computer Science, 2007) and Ph.D degree (Software Engineering, 2011) from NUCES-FAST, Islamabad. Currently, he is Head of Undergraduate Program in Software Engineering at IMSciences.

He has been a Guest Editor of numerous journals, such as Cluster Computing Journal Springer, Grid Computing Journal Springer, Expert Systems Journal Wiley, and Computational and Mathematical Organization Theory Journal Springer; Reviewer for IEEE Transactions on Evolutionary Computations, Neurocomputing Journal, IEEE Access, Expert Systems, Software: Practice and Experience, IEEE Transactions on Industrial Informatics, International Journal of Information Technology & Decision Making and Telematics and Informatics Journal. He is also Member Board Committee Institute of Creative Advanced Technologies, Science and Engineering, Korea (iCatse.org) http://icatse.org/.

View full text

Static malware detection and attribution in android byte-code through an end-to-end deep system

Highlights

Abstract

Introduction

Section snippets

Android malware analysis

Dynamic analysis

Related work

Proposed system for malware detection

Experimental setup

Results & discussion

Conclusions and future directions

Declaration of Competing Interest

Acknowledgments

Neural Netw.

Graph. Models

Neural Netw.

Detecting repackaged smartphone applications in third-party android marketplaces

Neural machine translation with recurrent attention modeling

Long short-term memory

Neural Comput.

Reducing the dimensionality of data with neural networks

Science

On the quantitative analysis of deep belief networks

Convolutional networks for images, speech, and time series

Learning hierarchical features for scene labeling

IEEE Trans. Pattern Anal. Mach. Intell.

Imagenet classification with deep convolutional neural networks

Recent advances in deep learning for speech research at microsoft

Inception-v4, inception-resnet and the impact of residual connections on learning

Least squares support vector machine classifiers

Neural Process. Lett.

Gated feedback recurrent neural networks

A practical guide to training restricted boltzmann machines

Momentum

Deep learning

Nature

The wake-sleep algorithm for unsupervised neural networks

Science