Skip to main content

Anti-obfuscation Binary Code Clone Detection Based on Software Gene

  • Conference paper
  • First Online:
Data Science (ICPCSEE 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1451))

Abstract

Information technology facilitates people’s lives greatly, while it also brings many security issues, such as code plagiarism, software in-fringement, and malicious code. In order to solve the problems, reverse engineering is applied to analyze abundant binary code manually, which costs a lot of time. However, due to the maturity of different obfuscation techniques, the disassembly code generated from the same function differs greatly in the opcode and control flow graph through different obfuscation options. This paper propose a method inspired by natural language processing, to realize the semantic similarity matching of binary code in basic block granularity and function granularity. In the similarity matching task of binary code obtained by different obfuscation options of LLVM, the indicator reaches 99%, which is better than the existing technologies.

Supported by the Foundation of National Natural Science Foundation of China (No. 61802435).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Allison, L., Dix, T.I.: A bit-string longest-common-subsequence algorithm. Inf. Process. Lett. 23(5), 305–310 (1986). https://doi.org/10.1016/0020-0190(86)90091-8

    Article  MathSciNet  MATH  Google Scholar 

  2. Hu, Y., Zhang, Y., Li, J., Wang, H., Li, B., Gu, D.: BinMatch: a semantics-based hybrid approach on binary code clone analysis. arXiv:1808.06216 [cs], August 2018. http://arxiv.org/abs/1808.06216. Accessed 28 Mar 2021

  3. Jhi, Y.-C., Wang, X., Jia, X., Zhu, S., Liu, P., Wu, D.: Value-based program characterization and its application to software plagiarism detection. In: Proceeding of the 33rd International Conference on Software Engineering - ICSE 2011, Waikiki, Honolulu, HI, USA, p. 756 (2011). https://doi.org/10.1145/1985793.1985899

  4. Zhang, F., Jhi, Y.-C., Wu, D., Liu, P., Zhu, S.: A first step towards algorithm plagiarism detection. In: Proceedings of the 2012 International Symposium on Software Testing and Analysis - ISSTA 2012, Minneapolis, MN, USA, p. 111 (2012). https://doi.org/10.1145/2338965.2336767

  5. Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014, Hong Kong, China, pp. 389–400 (2014). https://doi.org/10.1145/2635868.2635900

  6. Lindorfer, M., Di Federico, A., Maggi, F., Comparetti, P.M., Zanero, S.: Lines of malicious code: insights into the malicious software industry. In: Proceedings of the 28th Annual Computer Security Applications Conference on - ACSAC 2012, Orlando, Florida, p. 349 (2012). https://doi.org/10.1145/2420950.2421001

  7. Metzler, R., Klafter, J.: The random walk’s guide to anomalous diffusion: a fractional dynamics approach. Phys. Rep. 339(1), 1–77 (2000). https://doi.org/10.1016/S0370-1573(00)00070-3

  8. Ferrante, J.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9(3), 31 (1987)

    Article  Google Scholar 

  9. Simko, T.J.: Cloneless: Code Clone Detection via Program Dependence Graphs with Relaxed Constraints. California Polytechnic State University, San Luis Obispo, California (2019)

    Google Scholar 

  10. Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IIEEE Trans. Software Eng. 28(7), 654–670 (2002). https://doi.org/10.1109/TSE.2002.1019480

  11. Wu, Y., et al.: SCDetector: software functional clone detection based on semantic tokens analysis. In: 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 821–833, September 2020

    Google Scholar 

  12. Zou, Y., Ban, B., Xue, Y., Xu, Y.: CCGraph: a PDG-based code clone detector with approximate graph matching. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Virtual Event Australia, pp. 931–942, December 2020. https://doi.org/10.1145/3324884.3416541

  13. Krinke, J.: Identifying similar code with program dependence graphs, pp. 301–309, February 2001. https://doi.org/10.1109/WCRE.2001.957835

  14. Fang, C., Liu, Z., Shi, Y., Huang, J., Shi, Q.: Functional code clone detection with syntax and semantics fusion learning. In: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event USA, pp. 516–527, July 2020. https://doi.org/10.1145/3395363.3397362

  15. Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), Bethesda, MD, USA, pp. 368–377 (1998). https://doi.org/10.1109/ICSM.1998.738528

  16. Lazar, F., Banias, O.: Clone detection algorithm based on the abstract syntax tree approach. In: 2014 IEEE 9th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 73–78, May 2014. https://doi.org/10.1109/SACI.2014.6840038

  17. Buch, L., Andrzejak, A.: Learning-based recursive aggregation of abstract syntax trees for code clone detection. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, pp. 95–104, February 2019. https://doi.org/10.1109/SANER.2019.8668039

  18. Xue, H., Venkataramani, G., Lan, T.: Clone-slicer: detecting domain specific binary code clones through program slicing. In: Proceedings of the 2018 Workshop on Forming an Ecosystem Around Software Transformation - FEAST 2018, Toronto, Canada, pp. 27–33 (2018). https://doi.org/10.1145/3273045.3273047

  19. Brumley, D., Poosankam, P., Song, D., Zheng, J.: Automatic patch-based exploit generation is possible: techniques and implications. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), Oakland, CA, USA, May 2008, pp. 143–157 (2008). https://doi.org/10.1109/SP.2008.17

  20. Mahinthan, C., Xue, Y., Xu, Z., Liu, Y., Cho, C., Tan, H.B.K.: BinGo: cross-architecture cross-OS binary search, pp. 678–689, November 2016. https://doi.org/10.1145/2950290.2950350

  21. Hu, Y., Zhang, Y., Li, J., Gu, D.: Cross-architecture binary semantics understanding via similar code comparison. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Suita, pp. 57–67, March 2016. https://doi.org/10.1109/SANER.2016.50

  22. Ming, J., Xu, D., Jiang, Y., Wu, D.: BinSim: trace-based semantic binary diffing via system call sliced segment equivalence checking. In: Proceedings of the 26th USENIX Conference on Security Symposium, USA, August 2017, pp. 253–270. Accessed 28 Mar 2021

    Google Scholar 

  23. Pewny, J., Garmany, B., Gawlik, R., Rossow, C., Holz, T.: Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy, pp. 709–724, May 2015. https://doi.org/10.1109/SP.2015.49

  24. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code, February 2016. https://doi.org/10.14722/ndss.2016.23185

  25. Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna Austria, pp. 480–491, October 2016. https://doi.org/10.1145/2976749.2978370

  26. Toomey, D.: Code Similarity Comparison of Multiple Source Trees, May 2008

    Google Scholar 

  27. Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. In: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), Buenos Aires, Argentina, May 2017, pp. 88–98 (2017). https://doi.org/10.1109/ICPC.2017.22

  28. Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM - software protection for the masses, May 2015, pp. 3–9 (2015). https://doi.org/10.1109/SPRO.2015.10

  29. Rong, X.: word2vec parameter learning explained, November 2014

    Google Scholar 

  30. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv:1607.05368 [cs], July 2016. http://arxiv.org/abs/1607.05368. Accessed 29 Mar 2021

  31. Kirat, D., Vigna, G.: MalGene: automatic extraction of malware analysis evasion signature. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver Colorado USA, October 2015, pp. 769–780 (2015). https://doi.org/10.1145/2810103.2813642

  32. Zhao, B., Shan, Z., Liu, F., Zhao, B., Chen, Y., Sun, W.: Malware homology identification based on a gene perspective. Front. Inform. Technol. Electron. Eng. 20(6), 801–815 (2019). https://doi.org/10.1631/FITEE.1800523

  33. Liu, F., Zhang, P., Hou, Y., Wang, L., Shan, Z., Wang, J.: Malware analysis platform based on software gene for cyberspace security practice teaching. In: 2020 IEEE 2nd International Conference on Computer Science and Educational Informatization (CSEI), Xinxiang, China, pp. 140–143, June 2020. https://doi.org/10.1109/CSEI50228.2020.9142516

  34. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, K., Liu, F., Shan, Z., Zhang, C. (2021). Anti-obfuscation Binary Code Clone Detection Based on Software Gene. In: Zeng, J., Qin, P., Jing, W., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2021. Communications in Computer and Information Science, vol 1451. Springer, Singapore. https://doi.org/10.1007/978-981-16-5940-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-5940-9_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-5939-3

  • Online ISBN: 978-981-16-5940-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics