ABSTRACT
Software in the wild is usually released as stripped binaries that contain no debug information (e.g., function names). This paper studies the issue of reassigning descriptive names for functions to help facilitate reverse engineering. Since the essence of this issue is a data-driven prediction task, persuasive research should be based on sufficiently large-scale and diverse data. However, prior studies can only be based on small-scale datasets because their techniques suffer from heavyweight binary analysis, making them powerless in the face of big-size and large-scale binaries.
This paper presents the Neural Function Rename Engine (NFRE), a lightweight framework for function name reassignment that utilizes both sequential and structural information of assembly code. NFRE uses fine-grained and easily acquired features to model assembly code, making it more effective and efficient than existing techniques. In addition, we construct a large-scale dataset and present two data-preprocessing approaches to help improve its usability. Benefiting from the lightweight design, NFRE can be efficiently trained on the large-scale dataset, thereby having better generalization capability for unknown functions. The comparative experiments show that NFRE outperforms two existing techniques by a relative improvement of 32% and 16%, respectively, while the time cost for binary analysis is much less.
- Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4998–5007. https://doi.org/10.18653/v1/2020.acl-main.449 Google ScholarCross Ref
- Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). 38–49. isbn:9781450336758 https://doi.org/10.1145/2786805.2786849 Google ScholarDigital Library
- Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of The 33rd International Conference on Machine Learning, Maria Florina Balcan and Kilian Q. Weinberger (Eds.) (Proceedings of Machine Learning Research, Vol. 48). 2091–2100. http://proceedings.mlr.press/v48/allamanis16.htmlGoogle Scholar
- Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations. https://openreview.net/forum?id=H1gKYo09tXGoogle Scholar
- Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2Vec: Learning Distributed Representations of Code. Proc. ACM Program. Lang., 3, POPL (2019), Article 40, Jan., 29 pages. issn:2475-1421 https://doi.org/10.1145/3290353 Google ScholarDigital Library
- Antti Haapala. 2021. python-Levenshtein. https://github.com/ztane/python-LevenshteinGoogle Scholar
- Avast Software. 2021. RetDec. https://retdec.comGoogle Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR, Yoshua Bengio and Yann LeCun (Eds.). arxiv:1409.0473Google Scholar
- Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1957–1967. https://doi.org/10.18653/v1/D17-1209 Google ScholarCross Ref
- Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-Sequence Learning using Gated Graph Neural Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 273–283. https://doi.org/10.18653/v1/P18-1026 Google ScholarCross Ref
- Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural Code Comprehension: A Learnable Representation of Code Semantics. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). 31, Curran Associates, Inc., 3585–3597. https://proceedings.neurips.cc/paper/2018/file/17c3433fecc21b57000debdf7ad5c930-Paper.pdfGoogle Scholar
- F. A. Breve, L. Zhao, and M. G. Quiles. 2010. Semi-supervised learning from imperfect data through particle cooperation and competition. In The 2010 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN.2010.5596659 Google ScholarCross Ref
- David Brumley, Ivan Jager, Thanassis Avgerinos, and Edward J. Schwartz. 2011. BAP: A Binary Analysis Platform. In Computer Aided Verification, Ganesh Gopalakrishnan and Shaz Qadeer (Eds.). 463–469. isbn:978-3-642-22110-1Google ScholarDigital Library
- H. Cai, V. W. Zheng, and K. C. Chang. 2018. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Transactions on Knowledge and Data Engineering, 30, 9 (2018), Sep., 1616–1637. issn:1558-2191 https://doi.org/10.1109/TKDE.2018.2807452 Google ScholarDigital Library
- Canonical Ltd.. 2021. Enterprise Open Source and Linux | Ubuntu. https://ubuntu.comGoogle Scholar
- Canonical Ltd.. 2021. Ubuntu Debug Symbol Packages. https://wiki.ubuntu.com/Debug_Symbol_PackagesGoogle Scholar
- G. Chen, Z. Wang, R. Zhang, K. Zhou, S. Huang, K. Ni, Z. Qi, K. Chen, and H. Guan. 2010. A Refined Decompiler to Generate C Code with High Readability. In 2010 17th Working Conference on Reverse Engineering. 150–154. issn:1095-1350 https://doi.org/10.1109/WCRE.2010.24 Google ScholarDigital Library
- Qiming Chen and Ren Wu. 2017. CNN Is All You Need. CoRR, abs/1712.09662 (2017), arxiv:1712.09662Google Scholar
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). 785–794. isbn:9781450342322 https://doi.org/10.1145/2939672.2939785 Google ScholarDigital Library
- Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734. https://doi.org/10.3115/v1/D14-1179 Google ScholarCross Ref
- Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. 2017. Neural Nets Can Learn Function Type Signatures From Binaries. In 26th USENIX Security Symposium (USENIX Security 17). 99–116. isbn:978-1-931971-40-9 https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/chuaGoogle Scholar
- Yaniv David, Uri Alon, and Eran Yahav. 2020. Neural Reverse Engineering of Stripped Binaries Using Augmented Control Flow Graphs. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 225, Nov., 28 pages. https://doi.org/10.1145/3428293 Google ScholarDigital Library
- S. H. H. Ding, B. C. M. Fung, and P. Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP). 472–489.Google Scholar
- Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing. In Proceedings of the 2020 Network and Distributed Systems Security Symposium (NDSS).Google ScholarCross Ref
- Eclipse Foundation, Inc.. 2021. Eclipse Java development tools. https://www.eclipse.org/jdtGoogle Scholar
- Eli Bendersky. 2021. pycparser. https://github.com/eliben/pycparserGoogle Scholar
- Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Structured Neural Summarization. In International Conference on Learning Representations. https://openreview.net/forum?id=H1ersoRqtmGoogle Scholar
- Jerome Friedman. 2000. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29 (2000), 11, https://doi.org/10.1214/aos/1013203451 Google ScholarCross Ref
- Jerome H. Friedman. 2002. Stochastic Gradient Boosting. Comput. Stat. Data Anal., 38, 4 (2002), Feb., 367–378. issn:0167-9473 https://doi.org/10.1016/S0167-9473(01)00065-2 Google ScholarDigital Library
- Edward M Gellenbeck and Curtis R Cook. 1991. An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop. 65–81.Google Scholar
- GitHub, Inc.. 2021. GitHub. https://github.comGoogle Scholar
- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1631–1640. https://doi.org/10.18653/v1/P16-1154 Google ScholarCross Ref
- Han Gao. 2021. Code for Neural Function Rename Engine. https://github.com/USTC-TTCN/NFREGoogle Scholar
- Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. 2018. Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS ’18). 1667–1680. isbn:9781450356930 https://doi.org/10.1145/3243734.3243866 Google ScholarDigital Library
- Hex-Rays SA. 2021. Hex-Rays Decompiler. https://www.hex-rays.com/products/decompilerGoogle Scholar
- Hex-Rays SA. 2021. IDA Pro. https://www.hex-rays.com/products/idaGoogle Scholar
- Intel Corporation. 2021. Intel Advanced Encryption Standard Instructions (AES-NI). https://software.intel.com/content/www/us/en/develop/articles/intel-advanced-encryption-standard-instructions-aes-ni.htmlGoogle Scholar
- Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller. 2011. Labeling Library Functions in Stripped Binaries. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools (PASTE ’11). 1–8. isbn:9781450308496 https://doi.org/10.1145/2024569.2024571 Google ScholarDigital Library
- Alan Jaffe. 2017. Suggesting Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). 1050–1052. isbn:9781450351058 https://doi.org/10.1145/3106237.3121274 Google ScholarDigital Library
- Alan Jaffe, Jeremy Lacomis, Edward J. Schwartz, Claire Le Goues, and Bogdan Vasilescu. 2018. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of the 26th Conference on Program Comprehension (ICPC ’18). 20–30. isbn:9781450357142 https://doi.org/10.1145/3196321.3196330 Google ScholarDigital Library
- Yingjiu Li Jiayun Xu and Robert H. Deng. 2021. Differential Training: A Generic Framework to Reduce Label Noises for Android Malware Detection. In Proceedings of the Network and Distributed System Security Symposium, NDSS.Google Scholar
- D. S. Katz, J. Ruchti, and E. Schulte. 2018. Using recurrent neural networks for decompilation. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 346–356. issn:null https://doi.org/10.1109/SANER.2018.8330222 Google ScholarCross Ref
- Shachar Kaufman, Saharon Rosset, Claudia Perlich, and Ori Stitelman. 2012. Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Trans. Knowl. Discov. Data, 6, 4 (2012), Article 15, Dec., 21 pages. issn:1556-4681 https://doi.org/10.1145/2382577.2382579 Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arxiv:1412.6980.Google Scholar
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. https://openreview.net/forum?id=SJU4ayYglGoogle Scholar
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proc. ACL. https://doi.org/10.18653/v1/P17-4012 Google ScholarCross Ref
- Taku Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia. 66–75. https://doi.org/10.18653/v1/P18-1007 Google ScholarCross Ref
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 66–71. https://doi.org/10.18653/v1/D18-2012 Google ScholarCross Ref
- J. Lacomis, P. Yin, E. Schwartz, M. Allamanis, C. Le Goues, G. Neubig, and B. Vasilescu. 2019. DIRE: A Neural Approach to Decompiled Identifier Naming. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 628–639. issn:1938-4300 https://doi.org/10.1109/ASE.2019.00064 Google ScholarDigital Library
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Carla E. Brodley and Andrea Pohoreckyj Danyluk (Eds.). 282–289.Google ScholarDigital Library
- A. LeClair, S. Jiang, and C. McMillan. 2019. A Neural Model for Generating Natural Language Summaries of Program Subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 795–806. issn:0270-5257 https://doi.org/10.1109/ICSE.2019.00087 Google ScholarDigital Library
- Alexander LeClair and Collin McMillan. 2019. Recommendations for Datasets for Source Code Summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. 3931–3937. https://doi.org/10.18653/v1/N19-1394 Google ScholarCross Ref
- JongHyup Lee, Thanassis Avgerinos, and David Brumley. 2011. TIE: Principled Reverse Engineering of Types in Binary Programs. In Proceedings of the Network and Distributed System Security Symposium, NDSS.Google Scholar
- Qimai Li, Zhichao Han, and Xiao ming Wu. 2018. Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16098Google Scholar
- Zhibo Liu and Shuai Wang. 2020. How Far We Have Come: Testing Decompilation Correctness of C Decompilers. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2020). Association for Computing Machinery, 475–487. isbn:9781450380089 https://doi.org/10.1145/3395363.3397370 Google ScholarDigital Library
- Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421. https://doi.org/10.18653/v1/D15-1166 Google ScholarCross Ref
- Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Detection of Intrusions and Malware, and Vulnerability Assessment, Roberto Perdisci, Clémentine Maurice, Giorgio Giacinto, and Magnus Almgren (Eds.). 309–329. isbn:978-3-030-22038-9Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR, Yoshua Bengio and Yann LeCun (Eds.). arxiv:1301.3781Google Scholar
- George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM, 38, 11 (1995), Nov., 39–41. issn:0001-0782 https://doi.org/10.1145/219717.219748 Google ScholarDigital Library
- Palo Alto Networks, Inc.. 2021. Domain Generation Algorithm (DGA) Detection. https://docs.paloaltonetworks.com/pan-os/9-1/pan-os-admin/threat-prevention/dns-security/domain-generation-algorithm-detection.htmlGoogle Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12 (2011), 2825–2830.Google ScholarDigital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). 701–710. isbn:9781450329569 https://doi.org/10.1145/2623330.2623732 Google ScholarDigital Library
- PNF Software, Inc.. 2021. IDA Pro. https://www.pnfsoftware.com/Google Scholar
- Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 45–50. http://is.muni.cz/publication/884893/enGoogle Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). 3104–3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdfGoogle Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdfGoogle ScholarDigital Library
- Daniel Votipka, Seth Rabin, Kristopher Micinski, Jeffrey S. Foster, and Michelle L. Mazurek. 2020. An Observational Investigation of Reverse Engineers’ Processes. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1875–1892. isbn:978-1-939133-17-5 https://www.usenix.org/conference/usenixsecurity20/presentation/votipka-observationalGoogle Scholar
- Yanlin Wang and Hui Li. 2021. Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs. In The Thirty-Fifth AAAI Conference on Artificial Intelligence.Google Scholar
- Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS ’17). Association for Computing Machinery, New York, NY, USA. 363–376. isbn:9781450349468 https://doi.org/10.1145/3133956.3134018 Google ScholarDigital Library
- Khaled Yakdan, Sebastian Eschweiler, Elmar Gerhards-Padilla, and Matthew Smith. 2015. No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations. In 22nd Annual Network and Distributed System Security Symposium, NDSS 2015, San Diego, California, USA, February 8-11, 2015. The Internet Society.Google Scholar
- Yaniv David, Uri Alon and Eran Yahav. 2021. The Dataset of Nero. https://doi.org/10.5281/zenodo.4081641 Google ScholarCross Ref
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=Sy8gdB9xxGoogle Scholar
- Hong Zhao, Zhaobin Chang, Guangbin Bao, and Xiangyan Zeng. 2019. Malicious Domain Names Detection Algorithm Based on N-Gram. J. Comput. Networks Commun., 2019 (2019), 4612474:1–4612474:9. https://doi.org/10.1155/2019/4612474 Google ScholarDigital Library
- Lingxiao Zhao and Leman Akoglu. 2020. PairNorm: Tackling Oversmoothing in GNNs. In International Conference on Learning Representations. https://openreview.net/forum?id=rkecl1rtwBGoogle Scholar
- Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In 26th Annual Network and Distributed System Security Symposium, NDSS.Google ScholarCross Ref
Index Terms
- A lightweight framework for function name reassignment based on large-scale stripped binaries
Recommendations
Probabilistic Naming of Functions in Stripped Binaries
ACSAC '20: Proceedings of the 36th Annual Computer Security Applications ConferenceDebugging symbols in binary executables carry the names of functions and global variables. When present, they greatly simplify the process of reverse engineering, but they are almost always removed (stripped) for deployment. We present the design and ...
Function boundary detection in stripped binaries
ACSAC '19: Proceedings of the 35th Annual Computer Security Applications ConferenceAutomated cyber defense tools require the ability to analyze binary applications, detect vulnerabilities and automatically patch those vulnerabilities. The insertion of security mechanisms that operate at function boundaries (e.g, control flow ...
Robust hybrid name disambiguation framework for large databases
In many databases, science bibliography database for example, name attribute is the most commonly chosen identifier to identify entities. However, names are often ambiguous and not always unique which cause problems in many fields. Name disambiguation is ...
Comments