Abstract
Information technology facilitates people’s lives greatly, while it also brings many security issues, such as code plagiarism, software in-fringement, and malicious code. In order to solve the problems, reverse engineering is applied to analyze abundant binary code manually, which costs a lot of time. However, due to the maturity of different obfuscation techniques, the disassembly code generated from the same function differs greatly in the opcode and control flow graph through different obfuscation options. This paper propose a method inspired by natural language processing, to realize the semantic similarity matching of binary code in basic block granularity and function granularity. In the similarity matching task of binary code obtained by different obfuscation options of LLVM, the indicator reaches 99%, which is better than the existing technologies.
Supported by the Foundation of National Natural Science Foundation of China (No. 61802435).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Allison, L., Dix, T.I.: A bit-string longest-common-subsequence algorithm. Inf. Process. Lett. 23(5), 305–310 (1986). https://doi.org/10.1016/0020-0190(86)90091-8
Hu, Y., Zhang, Y., Li, J., Wang, H., Li, B., Gu, D.: BinMatch: a semantics-based hybrid approach on binary code clone analysis. arXiv:1808.06216 [cs], August 2018. http://arxiv.org/abs/1808.06216. Accessed 28 Mar 2021
Jhi, Y.-C., Wang, X., Jia, X., Zhu, S., Liu, P., Wu, D.: Value-based program characterization and its application to software plagiarism detection. In: Proceeding of the 33rd International Conference on Software Engineering - ICSE 2011, Waikiki, Honolulu, HI, USA, p. 756 (2011). https://doi.org/10.1145/1985793.1985899
Zhang, F., Jhi, Y.-C., Wu, D., Liu, P., Zhu, S.: A first step towards algorithm plagiarism detection. In: Proceedings of the 2012 International Symposium on Software Testing and Analysis - ISSTA 2012, Minneapolis, MN, USA, p. 111 (2012). https://doi.org/10.1145/2338965.2336767
Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014, Hong Kong, China, pp. 389–400 (2014). https://doi.org/10.1145/2635868.2635900
Lindorfer, M., Di Federico, A., Maggi, F., Comparetti, P.M., Zanero, S.: Lines of malicious code: insights into the malicious software industry. In: Proceedings of the 28th Annual Computer Security Applications Conference on - ACSAC 2012, Orlando, Florida, p. 349 (2012). https://doi.org/10.1145/2420950.2421001
Metzler, R., Klafter, J.: The random walk’s guide to anomalous diffusion: a fractional dynamics approach. Phys. Rep. 339(1), 1–77 (2000). https://doi.org/10.1016/S0370-1573(00)00070-3
Ferrante, J.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9(3), 31 (1987)
Simko, T.J.: Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints. California Polytechnic State University, San Luis Obispo, California (2019)
Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IIEEE Trans. Software Eng. 28(7), 654–670 (2002). https://doi.org/10.1109/TSE.2002.1019480
Wu, Y., et al.: SCDetector: software functional clone detection based on semantic tokens analysis. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 821–833, September 2020
Zou, Y., Ban, B., Xue, Y., Xu, Y.: CCGraph: a PDG-based code clone detector with approximate graph matching. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Virtual Event Australia, pp. 931–942, December 2020. https://doi.org/10.1145/3324884.3416541
Krinke, J.: Identifying similar code with program dependence graphs, pp. 301–309, February 2001. https://doi.org/10.1109/WCRE.2001.957835
Fang, C., Liu, Z., Shi, Y., Huang, J., Shi, Q.: Functional code clone detection with syntax and semantics fusion learning. In: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event USA, pp. 516–527, July 2020. https://doi.org/10.1145/3395363.3397362
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), Bethesda, MD, USA, pp. 368–377 (1998). https://doi.org/10.1109/ICSM.1998.738528
Lazar, F., Banias, O.: Clone detection algorithm based on the abstract syntax tree approach. In: 2014 IEEE 9th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 73–78, May 2014. https://doi.org/10.1109/SACI.2014.6840038
Buch, L., Andrzejak, A.: Learning-based recursive aggregation of abstract syntax trees for code clone detection. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, pp. 95–104, February 2019. https://doi.org/10.1109/SANER.2019.8668039
Xue, H., Venkataramani, G., Lan, T.: Clone-slicer: detecting domain specific binary code clones through program slicing. In: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation - FEAST 2018, Toronto, Canada, pp. 27–33 (2018). https://doi.org/10.1145/3273045.3273047
Brumley, D., Poosankam, P., Song, D., Zheng, J.: Automatic patch-based exploit generation is possible: techniques and implications. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), Oakland, CA, USA, May 2008, pp. 143–157 (2008). https://doi.org/10.1109/SP.2008.17
Mahinthan, C., Xue, Y., Xu, Z., Liu, Y., Cho, C., Tan, H.B.K.: BinGo: cross-architecture cross-OS binary search, pp. 678–689, November 2016. https://doi.org/10.1145/2950290.2950350
Hu, Y., Zhang, Y., Li, J., Gu, D.: Cross-architecture binary semantics understanding via similar code comparison. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Suita, pp. 57–67, March 2016. https://doi.org/10.1109/SANER.2016.50
Ming, J., Xu, D., Jiang, Y., Wu, D.: BinSim: trace-based semantic binary diffing via system call sliced segment equivalence checking. In: Proceedings of the 26th USENIX Conference on Security Symposium, USA, August 2017, pp. 253–270. Accessed 28 Mar 2021
Pewny, J., Garmany, B., Gawlik, R., Rossow, C., Holz, T.: Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy, pp. 709–724, May 2015. https://doi.org/10.1109/SP.2015.49
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code, February 2016. https://doi.org/10.14722/ndss.2016.23185
Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna Austria, pp. 480–491, October 2016. https://doi.org/10.1145/2976749.2978370
Toomey, D.: Code Similarity Comparison of Multiple Source Trees, May 2008
Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. In: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), Buenos Aires, Argentina, May 2017, pp. 88–98 (2017). https://doi.org/10.1109/ICPC.2017.22
Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM - software protection for the masses, May 2015, pp. 3–9 (2015). https://doi.org/10.1109/SPRO.2015.10
Rong, X.: word2vec parameter learning explained, November 2014
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv:1607.05368 [cs], July 2016. http://arxiv.org/abs/1607.05368. Accessed 29 Mar 2021
Kirat, D., Vigna, G.: MalGene: automatic extraction of malware analysis evasion signature. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver Colorado USA, October 2015, pp. 769–780 (2015). https://doi.org/10.1145/2810103.2813642
Zhao, B., Shan, Z., Liu, F., Zhao, B., Chen, Y., Sun, W.: Malware homology identification based on a gene perspective. Front. Inform. Technol. Electron. Eng. 20(6), 801–815 (2019). https://doi.org/10.1631/FITEE.1800523
Liu, F., Zhang, P., Hou, Y., Wang, L., Shan, Z., Wang, J.: Malware analysis platform based on software gene for cyberspace security practice teaching. In: 2020 IEEE 2nd International Conference on Computer Science and Educational Informatization (CSEI), Xinxiang, China, pp. 140–143, June 2020. https://doi.org/10.1109/CSEI50228.2020.9142516
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tang, K., Liu, F., Shan, Z., Zhang, C. (2021). Anti-obfuscation Binary Code Clone Detection Based on Software Gene. In: Zeng, J., Qin, P., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2021. Communications in Computer and Information Science, vol 1451. Springer, Singapore. https://doi.org/10.1007/978-981-16-5940-9_15
Download citation
DOI: https://doi.org/10.1007/978-981-16-5940-9_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5939-3
Online ISBN: 978-981-16-5940-9
eBook Packages: Computer ScienceComputer Science (R0)