A Review on Binary Code Analysis Datasets

Huang, Zhijian; Song, Shuguang; Liu, Han; Kuang, Hongyu; Zhang, Jingjing; Hu, Pengfei

doi:10.1007/978-3-031-71470-2_17

Zhijian Huang¹¹,
Shuguang Song¹²,
Han Liu¹²,
Hongyu Kuang¹¹,
Jingjing Zhang¹¹ &
…
Pengfei Hu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14999))

Included in the following conference series:

International Conference on Wireless Artificial Intelligent Computing Systems and Applications

247 Accesses

Abstract

Binary code analysis serves as the foundation for research in vulnerability discovery, software protection, and malicious code analysis. However, analyzing binary files is challenging due to the lack of high-level semantic information, leading to heavy dependence on analysts’ expertise and significantly impacting the efficiency of binary code analysis. Recent years has witnessed the blossom of machine learning models for binary analysis, but few researches address the problem of binary code datasets. In this paper, we review all the existing and available datasets, and make classification according to their application. We set up experiments to illustrate how dataset quality could affect the performance of machine learning models in binary function recognition. Based on the experimental evaluation, we present a discussion on the ground truth as well as quality evaluation problems for binary code datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Good, The Bad, and The Missing: A Comprehensive Study on the Rise of Machine Learning for Binary Code Analysis

Bin2vec: learning representations of binary executable programs for security tasks

Article Open access 01 July 2021

OpTrans: enhancing binary code similarity detection with function inlining re-optimization

Article 26 December 2024

References

vulmon. https://vulmon.com/
Vulnerability database and search engine (2015). https://vulners.com/
Darpa challenge binaries on linux os x and windows (2016). https://github.com/trailofbits/cb-multios
Alves-Foss, J., Venugopal, V.: The inconvenient truths of ground truth for binary analysis, January 2022. https://doi.org/10.14722/bar.2022.23010
Anderson, H.S., Roth, P.: Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)
Andriesse, D., Chen, X., Van Der Veen, V., Slowinska, A., Bos, H.: An $\{$In-Depth$\}$ analysis of disassembly on $\{$Full-Scale$\}$ x86/x64 binaries. In: 25th USENIX security symposium (USENIX security 16), pp. 583–600 (2016)
Google Scholar
Bao, T., Burket, J., Woo, M., Turner, R., Brumley, D.: $\{$BYTEWEIGHT$\}$: Learning to recognize functions in binary code. In: 23rd USENIX Security Symposium (USENIX Security 14), pp. 845–860 (2014)
Google Scholar
Black, P.E., Black, P.E.: Juliet 1.3 test suite: Changes from 1.2. US Department of Commerce, National Institute of Standards and Technology (2018)
Google Scholar
Bucek, J., Lange, K.D., v. Kistowski, J.: Spec cpu2017: next-generation compute benchmark. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 41–42 (2018)
Google Scholar
Chen, P., Chen, H.: Angora: efficient fuzzing by principled search. In: 2018 IEEE Symposium on Security and Privacy (SP), pp. 711–725. IEEE (2018)
Google Scholar
Corporation, S.P.E.: SPEC CPU2017 Benchmark (2017). https://www.spec.org/cpu2017
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Dolan-Gavitt, B., et al.: Lava: large-scale automated vulnerability addition. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 110–121. IEEE (2016)
Google Scholar
Fan, J., Li, Y., Wang, S., Nguyen, T.N.: Ac/c++ code vulnerability dataset with code changes and cve summaries. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp. 508–512 (2020)
Google Scholar
Guo, W., Mu, D., Xu, J., Su, P., Wang, G., Xing, X.: Lemna: explaining deep learning based security applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 364–379 (2018)
Google Scholar
Hagan, M.T., Demuth, H.B., Beale, M.: Neural network design. PWS Publishing Co. (1997)
Google Scholar
Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)
Article Google Scholar
Koo, H., Park, S., Kim, T.: A look back on a function identification problem, pp. 158–168 (2021)
Google Scholar
Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: International Conference on Learning Representations (2019)
Google Scholar
Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681 (2018)
Liu, Z., He, K.: A decade’s battle on dataset bias: Are we there yet? arXiv preprint arXiv:2403.08632 (2024)
Lu, S., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)
Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C., Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9_15
Chapter Google Scholar
Nguyen, V., et al.: Code action network for binary function scope identification. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12084, pp. 712–725. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47426-3_55
Chapter Google Scholar
Pang, C., Yu, R., Chen, Y., Koskinen, E., Portokalidis, G., Mao, B., Xu, J.: Sok: all you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 833–851. IEEE (2021)
Google Scholar
Paullada, A., Raji, I.D., Bender, E.M., Denton, E., Hanna, A.: Data and its (dis) contents: a survey of dataset development and use in machine learning research. Patterns 2(11) (2021)
Google Scholar
Pei, K., Guan, J., Williams-King, D., Yang, J., Jana, S.: Xda: accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770 (2020)
Roberts, J.M.: Virus share (2011). https://virusshare.com
Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., Ahmadi, M.: Microsoft malware classification challenge. arXiv preprint arXiv:1802.10135 (2018)
Shin, E.C.R., Song, D., Moazzezi, R.: Recognizing functions in binaries with neural networks. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 611–626 (2015)
Google Scholar
Standard Performance Evaluation Corporation: SPEC CPU2006 Benchmark (2006). https://www.spec.org/cpu2006
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. IEEE Computer Society (2017)
Google Scholar
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR 2011, pp. 1521–1528. IEEE (2011)
Google Scholar
Wilander, J., Nikiforakis, N., Younan, Y., Kamkar, M., Joosen, W.: RIPE: runtime intrusion prevention evaluator. In: Proceedings of the 27th Annual Computer Security Applications Conference, ACSAC. ACM (2011)
Google Scholar
Yu, S., Qu, Y., Hu, X., Yin, H.: $\{$DeepDi$\}$: learning a relational graph convolutional network model on instructions for fast and accurate disassembly. In: 31st USENIX Security Symposium (USENIX Security 22), pp. 2709–2725 (2022)
Google Scholar
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. ArXiv abs/1909.03496 (2019), https://api.semanticscholar.org/CorpusID:202539112
Zou, D., Wang, S., Xu, S., Li, Z., Jin, H.: $\mu $vuldeepecker: a deep learning-based system for multiclass vulnerability detection. IEEE Trans. Dependable Secure Comput. 18(5), 2224–2236 (2021). https://doi.org/10.1109/TDSC.2019.2942930
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Key Laboratory of Science and Technology on Information System Security, Systems Engineering Institute, Academy of Military Sciences, Beijing, China
Zhijian Huang, Hongyu Kuang & Jingjing Zhang
School of Computer Science and Technology, Shandong University, Qingdao, China
Shuguang Song, Han Liu & Pengfei Hu

Authors

Zhijian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shuguang Song
View author publications
You can also search for this author in PubMed Google Scholar
Han Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Kuang
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhijian Huang .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, GA, USA
Zhipeng Cai
Old Dominion University, Norfolk, VA, USA
Daniel Takabi
Beijing University of Posts and Telecommunications, Beijing, China
Shaoyong Guo
Shandong University, Qingdao, China
Yifei Zou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Z., Song, S., Liu, H., Kuang, H., Zhang, J., Hu, P. (2025). A Review on Binary Code Analysis Datasets. In: Cai, Z., Takabi, D., Guo, S., Zou, Y. (eds) Wireless Artificial Intelligent Computing Systems and Applications. WASA 2024. Lecture Notes in Computer Science, vol 14999. Springer, Cham. https://doi.org/10.1007/978-3-031-71470-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-71470-2_17
Published: 13 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71469-6
Online ISBN: 978-3-031-71470-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Review on Binary Code Analysis Datasets