Skip to main content

A Review on Binary Code Analysis Datasets

  • Conference paper
  • First Online:
Wireless Artificial Intelligent Computing Systems and Applications (WASA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14999))

  • 247 Accesses

Abstract

Binary code analysis serves as the foundation for research in vulnerability discovery, software protection, and malicious code analysis. However, analyzing binary files is challenging due to the lack of high-level semantic information, leading to heavy dependence on analysts’ expertise and significantly impacting the efficiency of binary code analysis. Recent years has witnessed the blossom of machine learning models for binary analysis, but few researches address the problem of binary code datasets. In this paper, we review all the existing and available datasets, and make classification according to their application. We set up experiments to illustrate how dataset quality could affect the performance of machine learning models in binary function recognition. Based on the experimental evaluation, we present a discussion on the ground truth as well as quality evaluation problems for binary code datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. vulmon. https://vulmon.com/

  2. Vulnerability database and search engine (2015). https://vulners.com/

  3. Darpa challenge binaries on linux os x and windows (2016). https://github.com/trailofbits/cb-multios

  4. Alves-Foss, J., Venugopal, V.: The inconvenient truths of ground truth for binary analysis, January 2022. https://doi.org/10.14722/bar.2022.23010

  5. Anderson, H.S., Roth, P.: Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)

  6. Andriesse, D., Chen, X., Van Der Veen, V., Slowinska, A., Bos, H.: An \(\{\)In-Depth\(\}\) analysis of disassembly on \(\{\)Full-Scale\(\}\) x86/x64 binaries. In: 25th USENIX security symposium (USENIX security 16), pp. 583–600 (2016)

    Google Scholar 

  7. Bao, T., Burket, J., Woo, M., Turner, R., Brumley, D.: \(\{\)BYTEWEIGHT\(\}\): Learning to recognize functions in binary code. In: 23rd USENIX Security Symposium (USENIX Security 14), pp. 845–860 (2014)

    Google Scholar 

  8. Black, P.E., Black, P.E.: Juliet 1.3 test suite: Changes from 1.2. US Department of Commerce, National Institute of Standards and Technology (2018)

    Google Scholar 

  9. Bucek, J., Lange, K.D., v. Kistowski, J.: Spec cpu2017: next-generation compute benchmark. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 41–42 (2018)

    Google Scholar 

  10. Chen, P., Chen, H.: Angora: efficient fuzzing by principled search. In: 2018 IEEE Symposium on Security and Privacy (SP), pp. 711–725. IEEE (2018)

    Google Scholar 

  11. Corporation, S.P.E.: SPEC CPU2017 Benchmark (2017). https://www.spec.org/cpu2017

  12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  13. Dolan-Gavitt, B., et al.: Lava: large-scale automated vulnerability addition. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 110–121. IEEE (2016)

    Google Scholar 

  14. Fan, J., Li, Y., Wang, S., Nguyen, T.N.: Ac/c++ code vulnerability dataset with code changes and cve summaries. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 508–512 (2020)

    Google Scholar 

  15. Guo, W., Mu, D., Xu, J., Su, P., Wang, G., Xing, X.: Lemna: explaining deep learning based security applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 364–379 (2018)

    Google Scholar 

  16. Hagan, M.T., Demuth, H.B., Beale, M.: Neural network design. PWS Publishing Co. (1997)

    Google Scholar 

  17. Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)

    Article  Google Scholar 

  18. Koo, H., Park, S., Kim, T.: A look back on a function identification problem, pp. 158–168 (2021)

    Google Scholar 

  19. Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: International Conference on Learning Representations (2019)

    Google Scholar 

  20. Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018)

  21. Liu, Z., He, K.: A decade’s battle on dataset bias: Are we there yet? arXiv preprint arXiv:2403.08632 (2024)

  22. Lu, S., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)

  23. Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C., Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9_15

    Chapter  Google Scholar 

  24. Nguyen, V., et al.: Code action network for binary function scope identification. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12084, pp. 712–725. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47426-3_55

    Chapter  Google Scholar 

  25. Pang, C., Yu, R., Chen, Y., Koskinen, E., Portokalidis, G., Mao, B., Xu, J.: Sok: all you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 833–851. IEEE (2021)

    Google Scholar 

  26. Paullada, A., Raji, I.D., Bender, E.M., Denton, E., Hanna, A.: Data and its (dis) contents: a survey of dataset development and use in machine learning research. Patterns 2(11) (2021)

    Google Scholar 

  27. Pei, K., Guan, J., Williams-King, D., Yang, J., Jana, S.: Xda: accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770 (2020)

  28. Roberts, J.M.: Virus share (2011). https://virusshare.com

  29. Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., Ahmadi, M.: Microsoft malware classification challenge. arXiv preprint arXiv:1802.10135 (2018)

  30. Shin, E.C.R., Song, D., Moazzezi, R.: Recognizing functions in binaries with neural networks. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 611–626 (2015)

    Google Scholar 

  31. Standard Performance Evaluation Corporation: SPEC CPU2006 Benchmark (2006). https://www.spec.org/cpu2006

  32. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. IEEE Computer Society (2017)

    Google Scholar 

  33. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR 2011, pp. 1521–1528. IEEE (2011)

    Google Scholar 

  34. Wilander, J., Nikiforakis, N., Younan, Y., Kamkar, M., Joosen, W.: RIPE: runtime intrusion prevention evaluator. In: Proceedings of the 27th Annual Computer Security Applications Conference, ACSAC. ACM (2011)

    Google Scholar 

  35. Yu, S., Qu, Y., Hu, X., Yin, H.: \(\{\)DeepDi\(\}\): learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In: 31st USENIX Security Symposium (USENIX Security 22), pp. 2709–2725 (2022)

    Google Scholar 

  36. Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. ArXiv abs/1909.03496 (2019), https://api.semanticscholar.org/CorpusID:202539112

  37. Zou, D., Wang, S., Xu, S., Li, Z., Jin, H.: \(\mu \)vuldeepecker: a deep learning-based system for multiclass vulnerability detection. IEEE Trans. Dependable Secure Comput. 18(5), 2224–2236 (2021). https://doi.org/10.1109/TDSC.2019.2942930

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhijian Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, Z., Song, S., Liu, H., Kuang, H., Zhang, J., Hu, P. (2025). A Review on Binary Code Analysis Datasets. In: Cai, Z., Takabi, D., Guo, S., Zou, Y. (eds) Wireless Artificial Intelligent Computing Systems and Applications. WASA 2024. Lecture Notes in Computer Science, vol 14999. Springer, Cham. https://doi.org/10.1007/978-3-031-71470-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71470-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71469-6

  • Online ISBN: 978-3-031-71470-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics