Skip to main content
Log in

The duplication issue within the Drebin dataset

  • Correspondence
  • Published:
Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

The Drebin dataset (in: NDSS, 2014) is the most supplied academic dataset of Android malware. Therefore it is the most used dataset in research papers on Android malware detection. The research community is using it for evaluation and comparison of their algorithms. We discovered that 49.35% of samples in this dataset has at least one other sample that is a repackaged version containing exactly the same sequence of opcode. The only differences between the original malware and the duplicated ones, in all cases, are the resources embedded and some strings in the code. For assessing the performance of malware detectors or classifiers, a part of the dataset is used for this purpose. So a major part of the testing set end up beeing the same samples that have been used in the training set. This situation can lead us, the research community, to overrate the performance of algorithms we are designing. In the worst case, it leads us to wrong conclusions and wrong directions for future research. Then we conduct an experiment where we test several classification algorithms on the Drebin dataset with and without the duplicates. Our results show that depending on the classifier the full dataset can lead from moderately (124%) to strongly (172%) underrated inaccuracy, and the order of performance of the algorithms is modified. Finally we provide the list of unique malware samples from the Drebin dataset, available on Github.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://f-droid.org/.

  2. https://source.android.com/devices/tech/dalvik/index.html.

  3. https://github.com/androguard/androguard.

  4. https://github.com/adrianlopezroche/fdupes.

  5. https://www.gnu.org/software/diffutils/.

  6. Url removed for preserving anonymity. The document is given as a supplementary material.

  7. http://www.cs.waikato.ac.nz/ml/weka/.

  8. Url removed for preserving anonymity. The document is given as a supplementary material.

References

  1. Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., Siemens, C.: Drebin: effective and explainable detection of android malware in your pocket. In: NDSS (2014)

  2. Bell, C.: Mutual information and maximal correlation as measures of dependence. Ann. Math. Stat. pp. 587–595 (1962)

  3. Dimjašević, M., Atzeni, S., Ugrina, I., Rakamaric, Z.: Evaluation of android malware detection based on system calls. In: Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics, pp. 1–8. ACM (2016)

  4. Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: Droidkin: Lightweight detection of android apps similarity. In: International Conference on Security and Privacy in Communication Systems, pp. 436–453. Springer (2014)

  5. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  6. McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer, S., Safaei, Y., Trickel, E., Zhao, Z., Doupe, A., et al.: Deep android malware detection. In: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, pp. 301–308. ACM (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paul Irolla.

Additional information

https://github.com/paul-irolla/drebin-nodup/tree/master.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 192 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Irolla, P., Dey, A. The duplication issue within the Drebin dataset. J Comput Virol Hack Tech 14, 245–249 (2018). https://doi.org/10.1007/s11416-018-0316-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-018-0316-z

Keywords

Navigation