Abstract
The Drebin dataset (in: NDSS, 2014) is the most supplied academic dataset of Android malware. Therefore it is the most used dataset in research papers on Android malware detection. The research community is using it for evaluation and comparison of their algorithms. We discovered that 49.35% of samples in this dataset has at least one other sample that is a repackaged version containing exactly the same sequence of opcode. The only differences between the original malware and the duplicated ones, in all cases, are the resources embedded and some strings in the code. For assessing the performance of malware detectors or classifiers, a part of the dataset is used for this purpose. So a major part of the testing set end up beeing the same samples that have been used in the training set. This situation can lead us, the research community, to overrate the performance of algorithms we are designing. In the worst case, it leads us to wrong conclusions and wrong directions for future research. Then we conduct an experiment where we test several classification algorithms on the Drebin dataset with and without the duplicates. Our results show that depending on the classifier the full dataset can lead from moderately (124%) to strongly (172%) underrated inaccuracy, and the order of performance of the algorithms is modified. Finally we provide the list of unique malware samples from the Drebin dataset, available on Github.
Similar content being viewed by others
Notes
Url removed for preserving anonymity. The document is given as a supplementary material.
Url removed for preserving anonymity. The document is given as a supplementary material.
References
Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., Siemens, C.: Drebin: effective and explainable detection of android malware in your pocket. In: NDSS (2014)
Bell, C.: Mutual information and maximal correlation as measures of dependence. Ann. Math. Stat. pp. 587–595 (1962)
Dimjašević, M., Atzeni, S., Ugrina, I., Rakamaric, Z.: Evaluation of android malware detection based on system calls. In: Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics, pp. 1–8. ACM (2016)
Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: Droidkin: Lightweight detection of android apps similarity. In: International Conference on Security and Privacy in Communication Systems, pp. 436–453. Springer (2014)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer, S., Safaei, Y., Trickel, E., Zhao, Z., Doupe, A., et al.: Deep android malware detection. In: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, pp. 301–308. ACM (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Irolla, P., Dey, A. The duplication issue within the Drebin dataset. J Comput Virol Hack Tech 14, 245–249 (2018). https://doi.org/10.1007/s11416-018-0316-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-018-0316-z