Static malware detection and attribution in android byte-code through an end-to-end deep system

https://doi.org/10.1016/j.future.2019.07.070Get rights and content

Highlights

  • Deep learning based Android Malware detection System.

  • Augmentation of Large-scale byte code dataset.

  • End-to-End Feature learning.

  • Static detection of Malware from byte code dataset.

Abstract

Android reflects a revolution in handhelds and mobile devices. It is a virtual machine based, an open source mobile platform that powers millions of smartphone and devices and even a larger no. of applications in its ecosystem. Surprisingly in a short lifespan, Android has also seen a colossal expansion in application malware with 99% of the total malware for smartphones being found in the Android ecosystem. Subsequently, quite a few techniques have been proposed in the literature for the analysis and detection of these malicious applications for the Android platform. The increasing and diversified nature of Android malware has immensely attenuated the usefulness of prevailing malware detectors, which leaves Android users susceptible to novel malware. Here in this paper, as a remedy to this problem, we propose an anti-malware system that uses customized learning models, which are sufficiently deep, and are ’End to End deep learning architectures which detect and attribute the Android malware via opcodes extracted from application bytecode’. Our results show that Bidirectional long short-term memory (BiLSTMs) neural networks can be used to detect static behavior of Android malware beating the state-of-the-art models without using handcrafted features. For our experiments in our system, we also choose to work with distinct and independent deep learning models leveraging sequence specialists like recurrent neural networks, Long Short Term Memory networks and its Bidirectional variation as well as those are more usual neural architectures like a network of all connected layers(fully connected), deep convnets, Diabolo network (autoencoders) and generative graphical models like deep belief networks for static malware analysis on Android. To test our system, we have also augmented a bytecode dataset from three open and independently maintained state-of-the-art datasets. Our bytecode dataset, which is on an order of magnitude large, essentially suffice for our experiments. Our results suggests that our proposed system can lead to better design of malware detectors as we report an accuracy of 0.999 and an F1-score of 0.996 on a large dataset of more than 1.8 million Android applications.

Introduction

Android represents a revolution in smartphones and mobile devices. It is an open source virtual machine based mobile platform powering countless smartphones, tablets, smart TVs, Car entertainment systems, and set top boxes etc. This means that developers have access to the source code, which can then be used to implement new features, add innovative attributes, and enhance the user experience.

Freelance developers also leverage Android to tinker and experiment with their ideas and offer a unique software platform often termed as ROM. Tech savvy users can then deploy such customized ROMs on their devices in case they are not satisfied with the user experience provided by the stock firmware installed on the device by its manufacturer. Lineage OS, Paranoid Android, and Resurrection remix are examples of such community driven ROMs. Therefore, while on other smartphone software you have the choice to choose from an array of applications (apps) available for the platform, Android is unique in a sense that in its ecosystem you can switch from stock firmware to custom ROMs, which are analogous to the choice you get in form of different Linux distributions. However, custom roms are not always a safe initiative to follow [1].

Another unique feature of Android platform is that applications are available from not only the native Google Play Store but also a variety of third party app sources. While apps downloaded and installed from the native app store can reasonably be trusted, the same cannot be said about the apps installed from third party sources [2], [3], [4].

However, the advent of malicious applications exploiting the ease of access provided for the above-mentioned software and hardware makes a nightmare situation for conventional malware detection systems to handle these malicious apps. Android has 3.5 million applications [5] in its ecology and 99% of the total malware is targeted towards Android [6].

On top of that, the native Google Play has 250 apps categorized as AntiVirus (AV), however 66% of the samples (170 apps) does not pass the AV criteria. These applications do not successfully distinguish malicious apps. They simply use whitelists and blacklists to sift through undesirable applications. The greater part of these applications are just a publicizing the PlayStore with a phony interface [7].

In the beginning, diagnoses of malware for Android was simple, since it showcased uncomplicated malicious attributes which could merely be named out by simply observing the tasks they performed. Eventually, malware for Android has progressively become unpredictable and have opted the more aggressive techniques. As an answer to this, researchers in malware security have proposed a variety of detection mechanisms. For a quick and brief summary Section 3 can be seen.

Problem: The customizability of Android through custom firmware and third party applications combined with the sensitivity of the information stored on these devices necessitates the development and implementation of cutting edge detection measures, to ensure the security of user and the device itself, and to apprehend the malware. This clearly depicts that the routine approaches for app defense are not sufficient to limit the ever-growing Android malware. Hence, the succeeding genesis of malware detection is highly desirable which can achieve the following objectives:

Objectives:

  • 1.

    Mirror the human effort of creating and extracting features via an automatic method and offload the burden to machines to learn progressively over time for correctly and efficiently carry out the feature engineering process.

  • 2.

    which is less prone to false alarms.

  • 3.

    the devised mechanism should support the large variety of development platforms.

Contribution: As a remedy in this paper, we resolve the dilemma of Android malware detection and attribution through a proposed anti-malware system of deep learning models which caters for the above mentioned objectives. Our devised system, which utilizes tailored variety of available models and essentially deal sequences such as recurrent neural networks (RNNs) [8] and its varied versions of Long short term memory (LSTM) [9] models and Bidirectional long short term memory (BiLSTM), others like deep autoencoders [10], deep belief networks (DBNs) [11] and those which have outperformed many vision benchmarks like fully connected neural networks [12], convolutional neural networks (CNNs) [13].

The details for design, customization and implementation of all these models, whereas the motivation backing the selection of these models and hyperparameters are given section wise. We show that our custom trained models can automatically engineer and learn features from an unlabeled and large input space (dataset) for malware detection similar to the tasks in category of vision and speech recognition [14], [15], [16]. This automatic and end-to-end feature engineering, classification [17] and malware attribution ability of these deep models equates for the human effort of manually analyzing the sample space and creating the features and static classification. Note that, analyzing malware without the need of executing is the essence of static analysis which is essentially helpful on low-power and memory-limited Android devices.

For our experiments, we also augment a large dataset of opcodes extracted from Android package kit(apks) and vectorize them through one-hot encoding. We aim to make this dataset freely available on demand. We also show that our system outperforms previous state-of-the-art.

Derived from our observations and experimentation through our deep learning system via BiLSTM achieves an accuracy of 0.99 and F1 score of 0.99, we conclude that our facts & figures prove the superiority of our tailored and trained versions of deep models over conventional machine learning methods in contemporary approaches towards mobile malware detection for Android. The rest of the paper is organized as followed:

Paper structure: In Section 2, rudimentary and essential details are provided, which accounts for malware analysis in the domain of Android (Section 2.1) and deep learning (Section 2.4). Section 3 describes the related work. Section 3 is then followed by the explanation of formal design and training of different deep learning architectures in the realm of malicious app. analysis for Android as Section 4. Our used tools are summarized in Section 5 as experimental setup. Computed results of our experimentation for all our models are given in Section 6. Finally, in Section 7 we provide the conclusion of our work.

Section snippets

Android malware analysis

Of late, numerous malware detection efforts have been made for vm based platforms like Android, in which dynamic analysis and static analysis approaches are the major directions among these attempts.

Dynamic analysis

The first approach for detection of malware is Dynamic Analysis. Dynamic analysis is considered as advance method of system analysis. In this type of analysis, system is analyzed at run time, for example [18]. Dynamic analysis seeks for the behavior of program. Analysis of the application is

Related work

In the section below, we provide the work closely related to our approach. We review necessary approaches for the sake of comparison and overview. Note that state-of-the-art like [27], [28], [29], [30], and [31] are discussed in Section 6.1. The rest of the related work is discussed in this section and summarized in Table 1.

ANASTASIA [32] is based on the static analysis of application behavior. The author used Machine Learning classification algorithms, which included ensemble, eXtreme Gradient

Proposed system for malware detection

In this section, we provide the details of our anti-malware system for Android platform. Our system is based upon the deep neural networks introduced in Section 2.4 and discussed in this section down below. The system is devised in a layered architecture depicted in Fig. 1.

input-step1, .dex extraction-step2 and preprocessing-step3: Remember that we have two separate test beds, one deployed over a GPU based machine and second over a daily life laptop. Besides, we have used 3 datasets of Android

Experimental setup

Our dataset nomenclature was discussed in Section 4.1. Now we turn towards the brief outlines of our experimental setup.

Each application’s data is converted to the one-hot representation of the opcodes extracted through static analysis from the byte-code directly. Note that we have evaluated the predictions of our deep learning models by splitting our dataset in to training and testing sets with a ratio of 60% and 40% respectively. We also performed standard 10-fold cross validation. First, we

Results & discussion

In this section, we portray some convincing points of interest of how our learning models have been evaluated and scored concerning our dataset. Afterwards, we will come back to the performance details on deep learning based models that we used contrasted with the best in class for classifying Android malware.

As a rule of thumb, the confusion matrix for a binary classification is a collection of 4 classes namely: true positives (TP), true negatives (TN), false positives (FP) and false negatives

Conclusions and future directions

In this paper, we have investigated the static malware detection for Android based smartphones via application of deep learning.

We have shown the nomenclature of a large-scale byte-code dataset for Android malware analysis and applied & shared the efficacy of models like fully connected neural networks, convolutional neural networks, autoencoders, deep belief networks and recurrent neural networks through architectural and empirical evidences. Our results not only match but in most cases beat

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We extend our gratitude to the NVIDIA Corporation for provisioning us with a Tesla K40c GPU, as the resource hungry testing and training of neural network were made only viable by this GPU. This research would have not been possible without the datasets freely available. We are thankful to Android Malware Dataset (AMD) [38], Drebin [27] and the VirshShare site [39] for upholding these contributions till date.

Muhammad Amin is an Assistant Professor at National University of Computer & Emerging Sciences. His interest is machine learning in large distributed Environments. He is currently Working towards incorporating machine learning in large scale malware analysis.

References (56)

  • Share of Android malware. Acccessed from:...
  • Android AV test malware. Acccessed from:...
  • YangZ. et al.

    Neural machine translation with recurrent attention modeling

    (2016)
  • HochreiterS. et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • HintonG.E. et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • SalakhutdinovR. et al.

    On the quantitative analysis of deep belief networks

  • LeCunY. et al.

    Convolutional networks for images, speech, and time series

  • FarabetClement et al.

    Learning hierarchical features for scene labeling

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • KrizhevskyAlex et al.

    Imagenet classification with deep convolutional neural networks

  • DengLi et al.

    Recent advances in deep learning for speech research at microsoft

  • SzegedyC. et al.

    Inception-v4, inception-resnet and the impact of residual connections on learning

    (2016)
  • S.K. Dash, G. Suarez-Tangil, S. Khan, K. Tam, M. Ahmadi, J. Kinder, L. Cavallaro, Droidscribe: Classifying android...
  • SuykensJ.A. et al.

    Least squares support vector machine classifiers

    Neural Process. Lett.

    (1999)
  • V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th...
  • ChungJ. et al.

    Gated feedback recurrent neural networks

    (2015)
  • HintonG.

    A practical guide to training restricted boltzmann machines

    Momentum

    (2010)
  • LeCunY. et al.

    Deep learning

    Nature

    (2015)
  • HintonG.E. et al.

    The wake-sleep algorithm for unsupervised neural networks

    Science

    (1995)
  • Cited by (77)

    • HCL-Classifier: CNN and LSTM based hybrid malware classifier for Internet of Things (IoT)

      2023, Future Generation Computer Systems
      Citation Excerpt :

      The researchers reported that RF was the only classifier to achieve better results with little fine-tuning. Amin et al. [12] utilized a pre-trained CNN model to extract features from processed data streams. They then introduced an optimized deep autoencoder (DAE) to learn temporal changes of the actions in the surveillance stream.

    View all citing articles on Scopus

    Muhammad Amin is an Assistant Professor at National University of Computer & Emerging Sciences. His interest is machine learning in large distributed Environments. He is currently Working towards incorporating machine learning in large scale malware analysis.

    Tamleek Ali is an Associate Professor at Institute of Management Sciences. His specialty is remote attestation and trust in large distributed environments. He has several research publications in conference and international journals. He is currently working towards incorporating machine learning in large-scale malware analysis.

    Mohammad Tehseen is Ph.D scholar and pursuing his Ph.D degree from department of computer science, University of Peshawar, Pakistan. He has published several papers in reputed journals and conferences. His research interests include Wireless Sensor Networks, Middleware, IoT, Wireless Sensor Based Applications, Deep Learning and Signal Processing.

    Murad Khan Received the B.S. degree in computer science from university of Peshawar Pakistan in 2008. He completed his Ph.D. degree in computer science and engineering from School of Computer Science and Engineering in Kyungpook National University Daegu, Republic of Korea, Dr. Khan published over 60 International conference and Journal papers along with two books chapters in Springer and CRC press. He also served as a TPC member in world reputed conferences and as a reviewer in numerous journals such as Future Generation Systems (Elsevier), IEEE Access, etc. In 2016, he was awarded with Qualcomm innovation award at Kyungpook National University for designing a Smart Home Control System. He was also awarded with Bronze Medal in ACM SAC 2015, Salamanca, Spain, on his distinguished work in Multi-criteria based Handover Techniques. He is a member of various communities such as ACM and IEEE, CRC press, etc. His area of expertise includes ad-hoc and wireless networks, architecture designing for Internet of Things, and Communication Protocols designing for smart cities and homes, Big Data Analytics, etc.

    Fakhri Alam Khan is Associate Professor at the Institute of Management Sciences, Peshawar, Pakistan. He received his PhD (with Distinction) in Computer Science in 2010 from the Institute of Scientific Computing, University of Vienna, Austria. His research interests include scientific workflows provenance, energy efficiency in WSN, multimedia technologies, nature inspired meta-heuristic algorithms, and workflow parameters significance measurement. He has published several research papers in various reputed peer-reviewed internationally recognized journals.

    Sajid Anwar is an Associate Professor in Center of Excellence in Information Technology Institute of Management Sciences (IMSciences), Peshawar, Pakistan. He earned his B.Sc. and M.Sc degree in computer science from University of Peshawar in 1997 and 1999, respectively. He completed MS degree (Computer Science, 2007) and Ph.D degree (Software Engineering, 2011) from NUCES-FAST, Islamabad. Currently, he is Head of Undergraduate Program in Software Engineering at IMSciences.

    He has been a Guest Editor of numerous journals, such as Cluster Computing Journal Springer, Grid Computing Journal Springer, Expert Systems Journal Wiley, and Computational and Mathematical Organization Theory Journal Springer; Reviewer for IEEE Transactions on Evolutionary Computations, Neurocomputing Journal, IEEE Access, Expert Systems, Software: Practice and Experience, IEEE Transactions on Industrial Informatics, International Journal of Information Technology & Decision Making and Telematics and Informatics Journal. He is also Member Board Committee Institute of Creative Advanced Technologies, Science and Engineering, Korea (iCatse.org) http://icatse.org/.

    View full text