Skip to main content
Log in

Deep neural architectures for large scale android malware analysis

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Android is arguably the most widely used mobile operating system in the world. Due to its widespead use and huge user base, it has attracted a lot of attention from the unsavory crowd of malware writers. Traditionally, techniques to counter such malicious software involved manually analyzing code and figuring out whether it was malicious or benign. However, due to the immense pace at which newer malware families are surfacing, such an approach is no longer feasible. Machine learning offers a way to tackle this issue of speed by automating the classification task. While several efforts have been made to use traditional machine learning techniques to Android malware detection, no reasonable effort has been made to utilize the newer, deep learning models in this domain. In this paper, we apply several deep learning models including fully connected, convolutional and recurrent neural networks as well as autoencoders and deep belief networks to detect Android malware from a large scale dataset of more than 55 GBs of Android malware. Further, we apply Bayesian machine learning to this problem domain to see how it fares with the deep learning based models while also providing insights into the dataset. We show that we are able to achieve better results using these models as compared to the state-of-the-art approaches. Our best model gets an F1 score of 0.986 with an AUC of 0.983 as compared to the existing best F1 score of 0.875 and AUC of 0.953.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. Not to be confused by the basic, well known machine learning model of Naïve Bayes.

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/

  2. Arp, D., Spreitzenbarth, M., Hübner, M., Gascon, H., Rieck, K., Siemens, C.: Drebin: effective and explainable detection of android malware in your pocket. In: Proceedings of the Annual Symposium on Network and Distributed System Security (NDSS) (2014)

  3. Arzt, S., Rasthofer, S., Fritz, C., Bodden, E., Bartel, A., Klein, J., Le Traon, Y., Octeau, D., McDaniel, P.: FlowDroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps. In: ACM SIGPLAN Notices, vol. 49, pp. 259–269, ACM (2014)

  4. Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University Press, Cambridge (2012)

    MATH  Google Scholar 

  5. Barrera, D., Van Oorschot, P.: Secure software installation on smartphones. Secur. Priv. IEEE 9(3), 42–48 (2011)

    Article  Google Scholar 

  6. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., Warde-Farley, D., Bengio, Y.: Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590 (2012)

  7. Bedini, A.: HDF5 for Python. http://www.h5py.org

  8. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  9. Biswas, A., Shapiro, V.: Approximate distance fields with non-vanishing gradients. Graph. Models 66(3), 133–159 (2004)

    Article  MATH  Google Scholar 

  10. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)

  11. Box, G.E., Tiao, G.C.: Bayesian Inference in Statistical Analysis, vol. 40. Wiley, New York (2011)

    MATH  Google Scholar 

  12. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014)

  13. Chollet, F.: Keras: deep learning library for theano and tensorflow. (2015)

  14. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. arXiv preprint arXiv:1502.02367 (2015)

  15. Dash, S.K., Suarez-Tangil, G., Khan, S., Tam, K., Ahmadi, M., Kinder, J., Cavallaro, L.: Droidscribe: classifying android malware based on runtime behavior. Mob. Secur. Technol. (MoST 2016) 7148, 1–12 (2016)

  16. Date, P., Hendler, J.A., Carothers, C.D.: Design index for deep neural networks. Proc. Comput. Sci. 88, 131–138 (2016)

    Article  Google Scholar 

  17. Davis, B., Chen, H.: RetroSkeleton: retrofitting android apps. In: Proceedings of the 11th International Conference on Mobile Systems, Applications and Services (MobiSys’13), pp. 25–28 (2013)

  18. Enck, W., Gilbert, P., Chun, B.G., Cox, L.P., Jung, J., McDaniel, P., Sheth, A.N.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10), pp. 1–6 (2010)

  19. Enck, W., Ongtang, M., McDaniel, P.: On lightweight mobile phone application certification. In: Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS’09), pp. 235–245. ACM (2009)

  20. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)

    MathSciNet  MATH  Google Scholar 

  21. Franzke, B., Kosko, B.: Using noise to speed up markov chain monte carlo estimation. Proc. Comput. Sci. 53, 113–120 (2015)

    Article  Google Scholar 

  22. Fuchs, A., Chaudhuri, A., Foster, J.: SCanDroid: automated security certification of Android applications. Technical reports (2009)

  23. Funahashi, K.I., Nakamura, Y.: Approximation of dynamical systems by continuous time recurrent neural networks. Neural Netw. 6(6), 801–806 (1993)

    Article  Google Scholar 

  24. Garcia, J., Hammad, M., Pedrood, B., Bagheri-Khaligh, A., Malek, S.: Obfuscation-resilient, efficient, and accurate detection and family identification of android malware. George Mason University, Technical reports (2015)

  25. GData: Mobile malware report: Q2/2015. https://public.gdatasoftware.com/Presse/Publikationen/Malware_Reports/G_DATA_MobileMWR_Q2_2015_EN.pdf. Accessed 15 July 2016

  26. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis, vol. 2. Taylor & Francis, New York (2014)

    MATH  Google Scholar 

  27. Hastings, W.K.: Monte carlo sampling methods using markov chains and their applications. Biometrika 57(1), 97–109 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  28. Hernández-Lobato, J.M., Adams, R.P.: Probabilistic backpropagation for scalable learning of bayesian neural networks. arXiv preprint arXiv:1502.05336 (2015)

  29. Hinton, G.: A practical guide to training restricted boltzmann machines. Momentum 9(1), 926 (2010)

    Google Scholar 

  30. Hinton, G.E., Dayan, P., Frey, B.J., Neal, R.M.: The wake-sleep algorithm for unsupervised neural networks. Science 268(5214), 1158 (1995)

    Article  Google Scholar 

  31. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  32. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  33. Homan, M.D., Gelman, A.: The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15(1), 1593–1623 (2014)

    MathSciNet  MATH  Google Scholar 

  34. Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)

    Article  Google Scholar 

  35. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 13th Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)

  36. Jolliffe, I.: Principal Component Analysis. Wiley Online Library (2002)

  37. Karakida, R., Okada, M., Amari, S.I.: Dynamical analysis of contrastive divergence learning. Neural Netw. 79, 78–87 (2016)

    Article  Google Scholar 

  38. Kohonen, T.: Self-Organizing Maps, vol. 30. Springer, New York (2001)

    MATH  Google Scholar 

  39. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. The Handb. Brain Theory Neural Netw. 3361(10), 1995 (1995)

    Google Scholar 

  40. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  41. Long, M., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. arXiv preprint arXiv:1605.06636 (2016)

  42. Mansfield-Devine, S.: Android architecture: attacking the weak points. Netw. Secur. 2012(10), 5–12 (2012)

    Article  Google Scholar 

  43. Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)

  44. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)

  45. Ongtang, M., McLaughlin, S., Enck, W., McDaniel, P.: Semantically rich application-centric security in android. In: Proceedings of the Annual Computer Security Applications Conference (ACSAC’09), pp. 340–349. IEEE (2009)

  46. Patil, A., Huard, D., Fonnesbeck, C.J.: PyMC: Bayesian stochastic modelling in python. J. Stat. Softw. 35(4), 1 (2010)

    Article  Google Scholar 

  47. Peng, H., Gates, C., Sarma, B., Li, N., Qi, Y., Potharaju, R., Nita-Rotaru, C., Molloy, I.: Using probabilistic generative models for ranking risks of android apps. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 241–252. ACM (2012)

  48. Powell, M.J.: A fast algorithm for nonlinearly constrained optimization calculations. In: Numerical analysis, pp. 144–157. Springer (1978)

  49. Salakhutdinov, R., Murray, I.: On the quantitative analysis of deep belief networks. In: Proceedings of the 25th International Conference on Machine Learning, pp. 872–879. ACM (2008)

  50. Sarma, B.P., Li, N., Gates, C., Potharaju, R., Nita-Rotaru, C., Molloy, I.: Android permissions: a perspective combining risks and benefits. In: Proceedings of the 17th ACM Symposium on Access Control Models and Technologies, pp. 13–22. ACM (2012)

  51. Sermanet, P., Frome, A., Real, E.: Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054 (2014)

  52. Shabtai, A., Fledel, Y., Elovici, Y.: Securing android-powered mobile devices using selinux. Secur. Priv. IEEE 8(3), 36–44 (2010)

    Article  Google Scholar 

  53. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)

    Article  MATH  Google Scholar 

  54. Symantec: Internet security threat report, volume 20. https://www.symantec.com/security_response/publications/threatreport.jsp Accessed 15 July 2016

  55. Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016)

  56. Tripp, O., Rubin, J.: A bayesian approach to privacy enforcement in smartphones. In: USENIX Security (2014)

  57. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103. ACM (2008)

  58. VXShare: VirusShare. . https://virusshare.com Accessed 3 Jan 2017

  59. Yan, L.K., Yin, H.: DroidScope: Seamlessly reconstructing the os and dalvik semantic views for dynamic android malware analysis. In: USENIX security symposium, pp. 569–584 (2012)

  60. Yang, Z., Hu, Z., Deng, Y., Dyer, C., Smola, A.: Neural machine translation with recurrent attention modeling. arXiv preprint arXiv:1607.05108 (2016)

  61. Zhou, Y., Jiang, X.: Dissecting android malware: Characterization and evolution. In: Security and Privacy (SP), 2012 IEEE Symposium on, pp. 95–109. IEEE (2012)

Download references

Acknowledgements

We would like to thank the maintainers of Drebin [2] the VirshShare site [58] for making their datasets available to us.The computation-intensive MCMC sampling and neural network training were made possible by the generous contribution of the Tesla K40c GPU by NVIDIA Corporation. The content of this paper is not necessarily endorsed by any of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Nauman.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nauman, M., Tanveer, T.A., Khan, S. et al. Deep neural architectures for large scale android malware analysis. Cluster Comput 21, 569–588 (2018). https://doi.org/10.1007/s10586-017-0944-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-0944-y

Keywords

Navigation