skip to main content
10.1145/3488933.3489033acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Functionality Recognition on Binary Code with Neural Representation Learning

Published: 25 February 2022 Publication History

Abstract

The functionality recognition of binary code has important application value in malware analysis, software forensics, binary code similarity analysis and other applications. Most of the existing methods are based on source code or machine learning strategies to carry out program similarity analysis, and this similarity analysis is also applied to a pair of programs, there are limitations in detection accuracy and quantity. Inspired by the recent great success of neural networks and representation learning in various program analysis tasks, We propose NPFI to analyze the binary code of the program and identify its functionality from the perspective of assembly instruction sequence. To evaluate the performance of NPFI, we built a large dataset consisting of 39,000 programs from six different categories collected from Google Code Jam. A large number of experiments show that the accuracy of NPFI in binary code function recognition can reach 95.8%, which is much better than the existing methods.

References

[1]
RoyChanchal, K., CordyJames, R., KoschkeRainer: Comparison and evaluationof code clone detection techniques and tools. Science of Computer Programming (2009)
[2]
Holmes, R., Murphy, G.: Using structural context to recommend source code examples. Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005. pp. 117–125 (2005)
[3]
Keivanloo, I., Rilling, J., Zou, Y.: Spotting working code examples. Proceedings ofthe 36th International Conference on Software Engineering (2014)
[4]
Higo, Y., Kamiya, T., Kusumoto, S., Inoue, K.: Aries: Refactoring support environment based on code clone analysis. In: IASTED Conf. on Software Engineering and Applications (2004)
[5]
Zibran, M., Roy, C.: Towards flexible code clone detection, management, and refactoring in ide. In: IWSC ’11 (2011)
[6]
Jiang, L., Su, Z., Chiu, E.: Context-based detection of clone-related bugs. In:ESEC-FSE ’07 (2007)
[7]
Li, Z., Lu, S., Myagmar, S., Zhou, Y.: Cp-miner: A tool for finding copy-paste andrelated bugs in operating system code. In: OSDI (2004)
[8]
Shin, E.C., Song, D., Moazzezi, R.: Recognizing functions in binaries with neuralnetworks. In: USENIX Security Symposium (2015)
[9]
Chua, Z.L., Shen, S., Saxena, P., Liang, Z.: Neural nets can learn function typesignatures from binaries. In: USENIX Security Symposium (2017)
[10]
Massarelli, L., Luna, G.A.D., Petroni, F., Querzoni, L., Baldoni, R.: Investigatinggraph embedding neural networks with unsupervised features extraction for binary analysis (2019)
[11]
Zuo, F., Li, X., Zhang, Z., Young, P., Luo, L., Zeng, Q.: Neural machine translation inspired binary code similarity comparison beyond function pairs. ArXiv abs/1808.04706 (2019)
[12]
Zhao, G., Huang, J.: Deepsim: deep learning code functional similarity. Proceedingsof the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2018)
[13]
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
[14]
Ding, S.H.H., Fung, B., Charland, P.: Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. 2019 IEEE Symposium on Security and Privacy (SP) pp. 472–489 (2019)
[15]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
[16]
Johnson, R., Zhang, T.: Deep pyramid convolutional neural networks for text categorization. In: ACL (2017)
[17]
Cho, K., Merrienboer, B.V., C¸aglar Gu¨lc¸ehre, Bahdanau, D., Bougares, F.,Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder–decoder for statistical machine translation. ArXiv abs/1406.1078 (2014)
[18]
Chung, J., C¸aglar Gu¨lc¸ehre, Cho, K., Bengio, Y.: Empirical evaluation of gatedrecurrent neural networks on sequence modeling. ArXiv abs/1412.3555 (2014)
[19]
Gong, Z.L., Zhang, D.X., Ming-Ming, H.U.: An improved svm algorithm for chinesetext classification. Computer Simulation 26(7), 164–167 (2009)
[20]
Bui, N.D.Q., Jiang, L., Yu, Y.: Cross-language learning for program classification using bilateral tree-based convolutional neural networks. In: AAAI Workshops (2018)
[21]
Crussell, J., Gibler, C., Chen, H.: Attack of the clones: Detecting cloned applications on android markets. In: ESORICS (2012)
[22]
Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. 2008 ACM/IEEE30th International Conference on Software Engineering pp. 321–330 (2008)
[23]
Krinke, J.: Identifying similar code with program dependence graphs. ProceedingsEighth Working Conference on Reverse Engineering pp. 301–309 (2001)
[24]
Liu, C., Chen, C., Han, J., Yu, P.S.: Gplag: detection of software plagiarism byprogram dependence graph analysis. In: KDD ’06 (2006)
[25]
Chilowicz, M., Duris, E., Jiang, L., Ellis, M.G., Anderson, C., Evans, W.S., Fraser,´ C., Ma, F., Greenan, K., Cui, B., Guan, J., Guo, T., Han, L., Wang, J.W., quotCode, Y.J., Kim, Y.: Multi-agent based sequence algorithm for detecting plagiarism and clones in java source code using abstract syntax tree (2020)
[26]
Xin, L.: Similarity analysis of malware's function-call graphs. Computer Engineering and Science (2014)
[27]
Olenick, B.M., Szyperski, C.A., Hunt, D.G., Hughes, G.L., Manis, W.A., Zmrhal, T.: Accessing and manipulating data in a data flow graph (2010)
[28]
Lu, M., Tan, D., Xiong, N., Chen, Z., Li, H.: Program classification usinggated graph attention neural network for online programming service. ArXiv abs/1903.03804 (2019)
[29]
Vytovtov, P., Chuvilin, K.: Unsupervised classifying of software source code usinggraph neural networks. 2019 24th Conference of Open Innovations Association (FRUCT) pp. 518–524 (2019)
[30]
gensim. https://radimrehurek.com/gensim/models/word2vec.html. Last accessed7 Jan 2021
[31]
Niitsuma, H.: Word2vec is only a special case of kernel correspondence analysisand kernels for natural language processing. ArXiv abs/1605.05087 (2016)

Cited By

View all
  • (2023)Semi-supervised Learning for Source Code Function Classification Using Hierarchical Density-Based Clustering2023 7th International Conference on System Reliability and Safety (ICSRS)10.1109/ICSRS59833.2023.10381331(513-517)Online publication date: 22-Nov-2023

Index Terms

  1. Functionality Recognition on Binary Code with Neural Representation Learning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        AIPR '21: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition
        September 2021
        715 pages
        ISBN:9781450384087
        DOI:10.1145/3488933
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 February 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Binary code
        2. Neural network
        3. Program functionality identification

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        AIPR 2021

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)4
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 27 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Semi-supervised Learning for Source Code Function Classification Using Hierarchical Density-Based Clustering2023 7th International Conference on System Reliability and Safety (ICSRS)10.1109/ICSRS59833.2023.10381331(513-517)Online publication date: 22-Nov-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media