research-article

Functionality Recognition on Binary Code with Neural Representation Learning

Authors:

Yaqian HuangAuthors Info & Claims

AIPR '21: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition

Pages 280 - 286

https://doi.org/10.1145/3488933.3489033

Published: 25 February 2022 Publication History

AIPR '21: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition

Functionality Recognition on Binary Code with Neural Representation Learning

Pages 280 - 286

Abstract
References

Abstract

The functionality recognition of binary code has important application value in malware analysis, software forensics, binary code similarity analysis and other applications. Most of the existing methods are based on source code or machine learning strategies to carry out program similarity analysis, and this similarity analysis is also applied to a pair of programs, there are limitations in detection accuracy and quantity. Inspired by the recent great success of neural networks and representation learning in various program analysis tasks, We propose NPFI to analyze the binary code of the program and identify its functionality from the perspective of assembly instruction sequence. To evaluate the performance of NPFI, we built a large dataset consisting of 39,000 programs from six different categories collected from Google Code Jam. A large number of experiments show that the accuracy of NPFI in binary code function recognition can reach 95.8%, which is much better than the existing methods.

References

[1]

RoyChanchal, K., CordyJames, R., KoschkeRainer: Comparison and evaluationof code clone detection techniques and tools. Science of Computer Programming (2009)

[2]

Holmes, R., Murphy, G.: Using structural context to recommend source code examples. Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005. pp. 117–125 (2005)

[3]

Keivanloo, I., Rilling, J., Zou, Y.: Spotting working code examples. Proceedings ofthe 36th International Conference on Software Engineering (2014)

[4]

Higo, Y., Kamiya, T., Kusumoto, S., Inoue, K.: Aries: Refactoring support environment based on code clone analysis. In: IASTED Conf. on Software Engineering and Applications (2004)

[5]

Zibran, M., Roy, C.: Towards flexible code clone detection, management, and refactoring in ide. In: IWSC ’11 (2011)

[6]

Jiang, L., Su, Z., Chiu, E.: Context-based detection of clone-related bugs. In:ESEC-FSE ’07 (2007)

[7]

Li, Z., Lu, S., Myagmar, S., Zhou, Y.: Cp-miner: A tool for finding copy-paste andrelated bugs in operating system code. In: OSDI (2004)

[8]

Shin, E.C., Song, D., Moazzezi, R.: Recognizing functions in binaries with neuralnetworks. In: USENIX Security Symposium (2015)

[9]

Chua, Z.L., Shen, S., Saxena, P., Liang, Z.: Neural nets can learn function typesignatures from binaries. In: USENIX Security Symposium (2017)

[10]

Massarelli, L., Luna, G.A.D., Petroni, F., Querzoni, L., Baldoni, R.: Investigatinggraph embedding neural networks with unsupervised features extraction for binary analysis (2019)

[11]

Zuo, F., Li, X., Zhang, Z., Young, P., Luo, L., Zeng, Q.: Neural machine translation inspired binary code similarity comparison beyond function pairs. ArXiv abs/1808.04706 (2019)

[12]

Zhao, G., Huang, J.: Deepsim: deep learning code functional similarity. Proceedingsof the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2018)

[13]

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)

Digital Library

[14]

Ding, S.H.H., Fung, B., Charland, P.: Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. 2019 IEEE Symposium on Security and Privacy (SP) pp. 472–489 (2019)

[15]

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)

Digital Library

[16]

Johnson, R., Zhang, T.: Deep pyramid convolutional neural networks for text categorization. In: ACL (2017)

[17]

Cho, K., Merrienboer, B.V., C¸aglar Gu¨lc¸ehre, Bahdanau, D., Bougares, F.,Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder–decoder for statistical machine translation. ArXiv abs/1406.1078 (2014)

[18]

Chung, J., C¸aglar Gu¨lc¸ehre, Cho, K., Bengio, Y.: Empirical evaluation of gatedrecurrent neural networks on sequence modeling. ArXiv abs/1412.3555 (2014)

[19]

Gong, Z.L., Zhang, D.X., Ming-Ming, H.U.: An improved svm algorithm for chinesetext classification. Computer Simulation 26(7), 164–167 (2009)

[20]

Bui, N.D.Q., Jiang, L., Yu, Y.: Cross-language learning for program classification using bilateral tree-based convolutional neural networks. In: AAAI Workshops (2018)

[21]

Crussell, J., Gibler, C., Chen, H.: Attack of the clones: Detecting cloned applications on android markets. In: ESORICS (2012)

[22]

Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. 2008 ACM/IEEE30th International Conference on Software Engineering pp. 321–330 (2008)

Digital Library

[23]

Krinke, J.: Identifying similar code with program dependence graphs. ProceedingsEighth Working Conference on Reverse Engineering pp. 301–309 (2001)

[24]

Liu, C., Chen, C., Han, J., Yu, P.S.: Gplag: detection of software plagiarism byprogram dependence graph analysis. In: KDD ’06 (2006)

[25]

Chilowicz, M., Duris, E., Jiang, L., Ellis, M.G., Anderson, C., Evans, W.S., Fraser,´ C., Ma, F., Greenan, K., Cui, B., Guan, J., Guo, T., Han, L., Wang, J.W., quotCode, Y.J., Kim, Y.: Multi-agent based sequence algorithm for detecting plagiarism and clones in java source code using abstract syntax tree (2020)

[26]

Xin, L.: Similarity analysis of malware's function-call graphs. Computer Engineering and Science (2014)

[27]

Olenick, B.M., Szyperski, C.A., Hunt, D.G., Hughes, G.L., Manis, W.A., Zmrhal, T.: Accessing and manipulating data in a data flow graph (2010)

[28]

Lu, M., Tan, D., Xiong, N., Chen, Z., Li, H.: Program classification usinggated graph attention neural network for online programming service. ArXiv abs/1903.03804 (2019)

[29]

Vytovtov, P., Chuvilin, K.: Unsupervised classifying of software source code usinggraph neural networks. 2019 24th Conference of Open Innovations Association (FRUCT) pp. 518–524 (2019)

[30]

gensim. https://radimrehurek.com/gensim/models/word2vec.html. Last accessed7 Jan 2021

[31]

Niitsuma, H.: Word2vec is only a special case of kernel correspondence analysisand kernels for natural language processing. ArXiv abs/1605.05087 (2016)

Cited By

Sun ZHan YHe DWang BZhang LMao D(2023)Semi-supervised Learning for Source Code Function Classification Using Hierarchical Density-Based Clustering2023 7th International Conference on System Reliability and Safety (ICSRS)10.1109/ICSRS59833.2023.10381331(513-517)Online publication date: 22-Nov-2023
https://doi.org/10.1109/ICSRS59833.2023.10381331

Index Terms

Functionality Recognition on Binary Code with Neural Representation Learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks
2. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation

Index terms have been assigned to the content through auto-classification.

Recommendations

BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network
ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security

Binary code similarity detection, which answers whether two pieces of binary code are similar, has been used in a number of applications,such as vulnerability detection and automatic patching. Existing approaches face two hurdles in their efforts to ...
On non-antipodal binary completely regular codes

Binary non-antipodal completely regular codes are characterized. Using a result on nonexistence of nontrivial binary perfect codes, it is concluded that there are no unknown nontrivial non-antipodal completely regular binary codes with minimum distance ...
Several families of binary cyclic codes with good parameters
Abstract
Binary odd-like duadic codes have parameters [ n , ( n + 1 ) / 2 ], where n is odd. It is well known that the minimum odd weight of odd-like duadic codes has the lower bound n. The binary quadratic-residue codes and the punctured binary Reed-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIPR '21: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition

September 2021

715 pages

ISBN:9781450384087

DOI:10.1145/3488933

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Key Research and Development Program of Shaanxi
Natural Science Basic Research Program of Shaanxi Province
International Science and Technology Cooperation Program of Shaanxi
National Natural Science Foundation of China - State Grid Corporation Joint Fund for Smart Grid
Science and Technology of Xi'an
the Special Funds for Construction of Key Disciplines in Universities in Shaanxi

Conference

AIPR 2021

AIPR 2021: 2021 4th International Conference on Artificial Intelligence and Pattern Recognition

September 24 - 26, 2021

Xiamen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
41
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun ZHan YHe DWang BZhang LMao D(2023)Semi-supervised Learning for Source Code Function Classification Using Hierarchical Density-Based Clustering2023 7th International Conference on System Reliability and Safety (ICSRS)10.1109/ICSRS59833.2023.10381331(513-517)Online publication date: 22-Nov-2023
https://doi.org/10.1109/ICSRS59833.2023.10381331

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten