Skip to main content
Log in

SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

With the exponential growth in Android apps, Android based devices are becoming victims of target attackers in the “silent battle” of cybernetics. To protect Android based devices from malware has become more complex and crucial for academicians and researchers. The main vulnerability lies in the underlying permission model of Android apps. Android apps demand permission or permission sets at the time of their installation. In this study, we consider permission and API calls as features that help in developing a model for malware detection. To select appropriate features or feature sets from thirty different categories of Android apps, we implemented ten distinct feature selection approaches. With the help of selected feature sets we developed distinct models by using five different unsupervised machine learning algorithms. We conduct an experiment on 5,00,000 distinct Android apps which belongs to thirty distinct categories. Empirical results reveals that the model build by considering rough set analysis as a feature selection approach, and farthest first as a machine learning algorithm achieved the highest detection rate of 98.8% to detect malware from real-world apps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. https://www.statista.com/statistics/330695/number-of-smartphone-users-worldwide/.

  2. https://www.appbrain.com/stats.

  3. https://www.businessofapps.com/data/app-statistics/#1.

  4. https://en.wikipedia.org/wiki/Google_Play.

  5. http://blog.trendmicro.com/trendlabs-security-intelligence/a-look-at-google-bouncer/.

  6. https://play.google.com/store?hl=en.

  7. https://source.android.com/security/reports/Google_Android_Security_2017_Report_Final.pdf.

  8. https://www.gdatasoftware.com/news/g-data-mobile-malware-report-2019-new-high-for-malicious-android-apps.

  9. https://developer.android.com/training/permissions/requesting.html.

  10. Mahindru, Arvind (2020), “Android permissions dataset, Android Malware and benign Application Data set (consist of permissions and API calls)”, Mendeley Data, V3, doi: 10.17632/b4mxg7ydb7.3.

  11. Testing were performed on local system.

  12. Live location of user is seen on Google Maps. Google Maps are pre-installed on Android based devices.

  13. Mahindru, Arvind (2020), “Android permissions dataset, Android Malware and benign Application Data set (consist of permissions and API calls)”, Mendeley Data, V3, doi: http://dx.doi.org/10.17632/b4mxg7ydb7.3

  14. https://play.google.com/store?hl=en.

  15. http://apk.hiapk.com/.

  16. http://www.appchina.com/.

  17. http://android.d.cn/.

  18. http://www.mumayi.com/.

  19. http://apk.gfan.com/.

  20. http://slideme.org/.

  21. http://download.pandaapp.com/?app=soft&controller=android#.V-p3f4h97IU.

  22. https://www.virustotal.com/.

  23. https://www.microsoft.com/en-in/windows/comprehensive-security.

  24. http://202.117.54.231:8080/.

  25. Malware families are identified by VirusTotal.

  26. https://www.statista.com/statistics/271774/share-of-android-platforms-on-mobile-devices-with-android-os/.

  27. https://data.mendeley.com/datasets/9b45k4hkdf/1.

  28. https://github.com/ArvindMahindru66/Computer-and-security-dataset.

  29. https://developer.android.com/guide/topics/permissions/overview.

  30. https://towardsdatascience.com/self-organizing-maps-ff5853a118d4.

  31. https://en.wikipedia.org/wiki/K-means_clustering.

  32. https://en.wikipedia.org/wiki/Farthest-first_traversal.

  33. A data set DS composed by a set (O) of n objects described by a set (SA) of l attributes.

  34. \(grid\_list\) consist of attributes.

  35. https://en.wikipedia.org/wiki/Euclidean_distance.

References

  1. Aafer Y, Du W, Yin H (2013) Droidapiminer: mining api-level features for robust malware detection in android. In: International conference on security and privacy in communication systems, Springer, pp 86–103

  2. Abawajy J, Kelarev A (2017) Iterative classifier fusion system for the detection of android malware. IEEE Transactions on Big Data

  3. Alam MS, Vuong ST (2013) Random forest classification for detecting android malware. In: 2013 IEEE international conference on green computing and communications and IEEE Internet of Things and IEEE cyber, physical and social computing, IEEE, pp 663–669

  4. Alazab M, Alazab M, Shalaginov A, Mesleh A, Awajan A (2020) Intelligent mobile malware detection using permission requests and API calls. Future Gener Comput Syst 107:509–521

    Article  Google Scholar 

  5. Almin SB, Chatterjee M (2015) A novel approach to detect android malware. Procedia Comput Sci 45:407–417

    Article  Google Scholar 

  6. Alzaylaee MK, Yerima SY, Sezer S (2020) DL-droid: deep learning based android malware detection using real devices. Comput Secur 89:101663

    Article  Google Scholar 

  7. Amos B, Turner H, White J (2013) Applying machine learning classifiers to dynamic android malware detection at scale. In: 2013 9th international wireless communications and mobile computing conference (IWCMC), IEEE, pp 1666–1671

  8. Andriatsimandefitra R, Tong VVT (2015) Detection and identification of android malware based on information flow monitoring. In: 2015 IEEE 2nd international conference on cyber security and cloud computing, IEEE, pp 200–203

  9. Arora A, Peddoju SK, Conti M (2019) Permpair: Android malware detection using permission pairs. IEEE Trans Inf Forensics Secur 15:1968–1982

    Article  Google Scholar 

  10. Arp D, Spreitzenbarth M, Hubner M, Gascon H, Rieck K, Siemens C (2014) Drebin: effective and explainable detection of android malware in your pocket. NDSS 14:23–26

    Google Scholar 

  11. Attar AE, Khatoun R, Lemercier M (2014) A gaussian mixture model for dynamic detection of abnormal behavior in smartphone applications. In: 2014 global information infrastructure and networking symposium (GIIS), IEEE, pp 1–6

  12. Babaagba KO, Adesanya SO (2019) A study on the effect of feature selection on malware analysis using machine learning. In: Proceedings of the 2019 8th international conference on educational and information technology, pp 51–55

  13. Barrera D, Kayacik HG, Oorschot PCV, Somayaji A (2010) A methodology for empirical analysis of permission-based security models and its application to android. In: Proceedings of the 17th ACM conference on computer and communications security, pp 73–84

  14. Bibi KF, Banu MN (2015) Feature subset selection based on filter technique. In: 2015 international conference on computing and communications technologies (ICCCT), IEEE, pp 1–6

  15. Birendra C (2016) Android permission model. arXiv preprint arXiv:160704256

  16. Blair DC (1979) Information retrieval, 2nd ed. C. J. van Rijsbergen. J Am Soc Inf Sci 30(6):374–375. https://doi.org/10.1002/asi.4630300621. https://ideas.repec.org/a/bla/jamest/v30y1979i6p374-375.html

  17. Blessie EC, Karthikeyan E (2012) Sigmis: a feature selection algorithm using correlation based method. J Algorithms Comput Technol 6(3):385–394

    Article  Google Scholar 

  18. Burguera I, Zurutuza U, Nadjm-Tehrani S (2011) Crowdroid: behavior-based malware detection system for android. In: Proceedings of the 1st ACM workshop on security and privacy in smartphones and mobile devices, pp 15–26

  19. Cai H, Meng N, Ryder B, Yao D (2018) Droidcat: effective android malware detection and categorization via app-level profiling. IEEE Trans Inf Forensics Secur 14(6):1455–1470

    Article  Google Scholar 

  20. Canbek G, Baykal N, Sagiroglu S (2017) Clustering and visualization of mobile application permissions for end users and malware analysts. In: 2017 5th international symposium on digital forensic and security (ISDFS), IEEE, pp 1–10

  21. Caviglione L, Gaggero M, Lalande JF, Mazurczyk W, Urbański M (2015) Seeing the unseen: revealing mobile malware hidden communications via energy consumption and artificial intelligence. IEEE Trans Inf Forensics Secur 11(4):799–810

    Article  Google Scholar 

  22. Chaikla N, Qi Y (1999) Genetic algorithms in feature selection. In: IEEE SMC’99 conference proceedings. 1999 IEEE international conference on systems, man, and cybernetics (Cat. No. 99CH37028), IEEE, vol 5, pp 538–540

  23. Chen PS, Lin SC, Sun CH (2015) Simple and effective method for detecting abnormal internet behaviors of mobile devices. Inf Sci 321:193–204

    Article  Google Scholar 

  24. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, pp 133–142

  25. Cruz AEC, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects. In: 2009 3rd international symposium on empirical software engineering and measurement, IEEE, pp 460–463

  26. Cui B, Jin H, Carullo G, Liu Z (2015) Service-oriented mobile malware detection system based on mining strategies. Pervas Mobile Comput 24:101–116

    Article  Google Scholar 

  27. Dixon B, Mishra S (2013) Power based malicious code detection techniques for smartphones. In: 2013 12th IEEE international conference on trust, security and privacy in computing and communications, IEEE, pp 142–149

  28. Enck W, Ongtang M, McDaniel P (2009) On lightweight mobile phone application certification. In: Proceedings of the 16th ACM conference on computer and communications security, pp 235–245

  29. Enck W, Gilbert P, Han S, Tendulkar V, Chun BG, Cox LP, Jung J, McDaniel P, Sheth AN (2014) Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans Comput Syst (TOCS) 32(2):1–29

    Article  Google Scholar 

  30. Faruki P, Ganmoor V, Laxmi V, Gaur MS, Bharmal A (2013) Androsimilar: robust statistical feature signature for android malware detection. In: Proceedings of the 6th international conference on security of information and networks, pp 152–159

  31. Fung CJ, Lam DY, Boutaba R (2014) Revmatch: An efficient and robust decision model for collaborative malware detection. In: 2014 IEEE network operations and management symposium (NOMS), IEEE, pp 1–9

  32. Guo DF, Sui AF, Shi YJ, Hu JJ, Lin GZ, Guo T (2014) Behavior classification based self-learning mobile malware detection. JCP 9(4):851–858

    Google Scholar 

  33. Han W, Xue J, Wang Y, Liu Z, Kong Z (2019) Malinsight: a systematic profiling based malware detection framework. J Netw Comput Appl 125:236–250

    Article  Google Scholar 

  34. Holland B, Deering T, Kothari S, Mathews J, Ranade N (2015) Security toolbox for detecting novel and sophisticated android malware. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, IEEE, vol 2, pp 733–736

  35. Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) Hindroid: an intelligent android malware detection system based on structured heterogeneous information network. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1507–1515

  36. Jerbi M, Dagdia ZC, Bechikh S, Said LB (2020) On the use of artificial malicious patterns for android malware detection. Comput Secur 92:101743

    Article  Google Scholar 

  37. Jouve PE, Nicoloyannis N (2005) A filter feature selection method for clustering. In: International symposium on methodologies for intelligent systems, Springer, pp 583–593

  38. Kadir AFA, Stakhanova N, Ghorbani AA (2015) Android botnets: What URLs are telling us. In: International conference on network and system security, Springer, pp 78–91

  39. Karbab EB, Debbabi M, Derhab A, Mouheb D (2018) Maldozer: automatic framework for android malware detection using deep learning. Digit Investig 24:S48–S59

    Article  Google Scholar 

  40. Kohavi R, John GH et al (1997) Wrappers for feature subset selection. Artificial intelligence 97(1–2):273–324

    Article  Google Scholar 

  41. Kumar M, et al. (2013) An optimized farthest first clustering algorithm. In: 2013 Nirma University international conference on engineering (NUiCONE), IEEE, pp 1–5

  42. Lee WY, Saxe J, Harang R (2019) Seqdroid: obfuscated android malware detection using stacked convolutional and recurrent neural networks. In: Deep Learning applications for cyber security, Springer, pp 197–210

  43. Lindorfer M, Neugschwandtner M, Weichselbaum L, Fratantonio Y, Veen VVD, Platzer C (2014) Andrubis–1,000,000 apps later: a view on current android malware behaviors. In: 2014 third international workshop on building analysis datasets and gathering experience returns for security (BADGERS), IEEE, pp 3–17

  44. Ma Z, Ge H, Liu Y, Zhao M, Ma J (2019) A combination method for android malware detection based on control flow graphs and machine learning algorithms. IEEE Access 7:21235–21245

    Article  Google Scholar 

  45. Mahindru A, Sangal A (2019) Deepdroid: feature selection approach to detect android malware using deep learning. In: 2019 IEEE 10th international conference on software engineering and service science (ICSESS), IEEE, pp 16–19

  46. Mahindru A, Sangal A (2020a) Feature-based semi-supervised learning to detect malware from android. Automated software engineering: a deep learning-based approach. Springer, Berlin, pp 93–118

    Chapter  Google Scholar 

  47. Mahindru A, Sangal A (2020b) Feature-based semi-supervised learning to detect malware from android. Automated software engineering: a deep learning-based approach. Springer, Berlin, pp 93–118

    Chapter  Google Scholar 

  48. Mahindru A, Sangal A (2020a) Gadroid: a framework for malware detection from android by using genetic algorithm as feature selection approach. Int J Adv Sci Technol 29(5):5532–5543

    Google Scholar 

  49. Mahindru A, Sangal A (2020b) Perbdroid: effective malware detection model developed using machine learning classification techniques. A journey towards bio-inspired techniques in software engineering. Springer, Berlin, pp 103–139

    Chapter  Google Scholar 

  50. Mahindru A, Singh P (2017) Dynamic permissions based android malware detection using machine learning techniques. In: Proceedings of the 10th innovations in software engineering conference, pp 202–210

  51. Martinelli F, Mercaldo F, Saracino A (2017) Bridemaid: an hybrid tool for accurate detection of android malware. In: Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp 899–901

  52. Milosevic N, Dehghantanha A, Choo KKR (2017) Machine learning aided android malware classification. Comput Electr Eng 61:266–274

    Article  Google Scholar 

  53. Narudin FA, Feizollah A, Anuar NB, Gani A (2016) Evaluation of machine learning classifiers for mobile malware detection. Soft Comput 20(1):343–357

    Article  Google Scholar 

  54. Ng DV, Hwang JIG (2014) Android malware detection using the dendritic cell algorithm. In: 2014 international conference on machine learning and cybernetics, IEEE, vol 1, pp 257–262

  55. Novakovic J (2010) The impact of feature selection on the accuracy of naïve bayes classifier. In: 18th telecommunications forum TELFOR, vol 2, pp 1113–1116

  56. Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356

    Article  Google Scholar 

  57. Plackett RL (1983) Karl pearson and the chi-squared test. International Statistical Review/Revue Internationale de Statistique 59–72  

  58. Portokalidis G, Homburg P, Anagnostakis K, Bos H (2010) Paranoid android: versatile protection for smartphones. In: Proceedings of the 26th annual computer security applications conference, pp 347–356

  59. Quan D, Zhai L, Yang F, Wang P (2014) Detection of android malicious apps based on the sensitive behaviors. In: 2014 IEEE 13th international conference on trust, security and privacy in computing and communications, IEEE, pp 877–883

  60. Rahman M (2013) Droidmln: a markov logic network approach to detect android malware. In: 2013 12th international conference on machine learning and applications, IEEE, vol 2, pp 166–169

  61. Rahman SSMM, Saha SK (2018) Stackdroid: evaluation of a multi-level approach for detecting the malware on android using stacked generalization. In: International conference on recent trends in image processing and pattern recognition, Springer, pp 611–623

  62. Shabtai A, Kanonov U, Elovici Y, Glezer C, Weiss Y (2012) “Andromaly”: a behavioral malware detection framework for android devices. J Intell Inf Syst 38(1):161–190

    Article  Google Scholar 

  63. Sheen S, Anitha R, Natarajan V (2015) Android based malware detection using a multifeature collaborative decision fusion approach. Neurocomputing 151:905–912

    Article  Google Scholar 

  64. Shen T, Zhongyang Y, Xin Z, Mao B, Huang H (2014) Detect android malware variants using component based topology graph. In: 2014 IEEE 13th international conference on trust, security and privacy in computing and communications, IEEE, pp 406–413

  65. Suarez-Tangil G, Tapiador JE, Peris-Lopez P, Pastrana S (2015) Power-aware anomaly detection in smartphones: an analysis of on-platform versus externalized operation. Pervas Mobile Comput 18:137–151

    Article  Google Scholar 

  66. Tam K, Khan SJ, Fattori A, Cavallaro L (2015) Copperdroid: automatic reconstruction of android malware behaviors. In: Ndss

  67. Tong F, Yan Z (2017) A hybrid approach of mobile malware detection in android. J Parallel Distrib Comput 103:22–31

    Article  Google Scholar 

  68. Tramontana E, Verga G (2019) Mitigating privacy-related risks for android users. In: 2019 IEEE 28th international conference on enabling technologies: infrastructure for collaborative enterprises (WETICE), IEEE, pp 243–248

  69. Vinayakumar R, Alazab M, Soman K, Poornachandran P, Venkatraman S (2019) Robust intelligent malware detection using deep learning. IEEE Access 7:46717–46738

    Article  Google Scholar 

  70. Wang W, Wang X, Feng D, Liu J, Han Z, Zhang X (2014) Exploring permission-induced risk in android applications for malicious application detection. IEEE Trans Inf Forensics Secur 9(11):1869–1882

    Article  Google Scholar 

  71. Wang W, Zhao M, Wang J (2019) Effective android malware detection with a hybrid model based on deep autoencoder and convolutional neural network. J Ambient Intell Humaniz Comput 10(8):3035–3043

    Article  Google Scholar 

  72. Wei F, Li Y, Roy S, Ou X, Zhou W (2017) Deep ground truth analysis of current android malware. In: International conference on detection of intrusions and malware, and vulnerability assessment, Springer, pp 252–276

  73. Wei TE, Mao CH, Jeng AB, Lee HM, Wang HT, Wu DJ (2012) Android malware detection via a latent network behavior analysis. In: 2012 IEEE 11th international conference on trust, security and privacy in computing and communications, IEEE, pp 1251–1258

  74. Wu DJ, Mao CH, Wei TE, Lee HM, Wu KP (2012) Droidmat: Android malware detection through manifest and API calls tracing. In: 2012 seventh Asia joint conference on information security, IEEE, pp 62–69

  75. Xiao X, Zhang S, Mercaldo F, Hu G, Sangaiah AK (2019) Android malware detection based on system call sequences and LSTM. Multimed Tools Appl 78(4):3979–3999

    Article  Google Scholar 

  76. Xu R, Saïdi H, Anderson R (2012) Aurasium: practical policy enforcement for android applications. In: Presented as part of the 21st \(\{\)USENIX\(\}\) security symposium (\(\{\)USENIX\(\}\) Security 12), pp 539–552

  77. Yang L, Ganapathy V, Iftode L (2011) Enhancing mobile malware detection with social collaboration. In: 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international conference on social computing, IEEE, pp 572–576

  78. Yewale A, Singh M (2016) Malware detection based on opcode frequency. In: 2016 international conference on advanced communication control and computing technologies (ICACCCT), IEEE, pp 646–649

  79. Yuxin D, Siyi Z (2019) Malware detection based on deep learning algorithm. Neural Comput Appl 31(2):461–472

    Article  Google Scholar 

  80. Zhou Y, Jiang X (2012) Dissecting android malware: characterization and evolution. In: 2012 IEEE symposium on security and privacy, IEEE, pp 95–109

  81. Zhu HJ, Jiang TH, Ma B, You ZH, Shi WL, Cheng L (2018) Hemd: a highly efficient random forest-based malware detection framework for android. Neural Comput Appl 30(11):3353–3361

    Article  Google Scholar 

  82. Zhu HJ, You ZH, Zhu ZX, Shi WL, Chen X, Cheng L (2018b) Droiddet: effective and robust detection of android malware using static analysis along with rotation forest model. Neurocomputing 272:638–646

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arvind Mahindru.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahindru, A., Sangal, A.L. SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches. Int. J. Mach. Learn. & Cyber. 12, 1369–1411 (2021). https://doi.org/10.1007/s13042-020-01238-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01238-9

Keywords

Navigation