Skip to main content

SCRUTINIZER: Detecting Code Reuse in Malware via Decompilation and Machine Learning

  • Conference paper
  • First Online:
Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12756))

Abstract

Growing numbers of advanced malware-based attacks against governments and corporations, for political, financial and scientific gains, have taken security breaches to the next level. In response to such attacks, both academia and industry have investigated techniques to model and reconstruct these attacks and to defend against them. While such efforts have been all useful in mitigating the effects of modern attacks, automated malware code reuse inspection and campaign attribution have received less attention.

In this paper, we present an automated system, called SCRUTINIZER, to identify code reuse in malware via a novel machine learning-based encoding mechanism at the function-level. By creating a large knowledge base of previously observed and tagged malware campaigns, we can compare unknown samples against this knowledge base and determine how much overlap exists. SCRUTINIZER leverages an unsupervised learning approach to filter out irrelevant functions before code reuse detection. It provides two valuable capabilities. First, it identifies ties between an unknown sample and those malware specimens that are known to be used by a specific campaign. Second, it inspects if specific tools or functionalities are used by a campaign. Using SCRUTINIZER, we were able to identify 12 samples that were previously unknown to us and that we were able to correctly assign to well-known APT campaigns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We plan to release a labeled dataset of malware binaries that have been used by different APT campaigns that we have access to.

  2. 2.

    Version 9.1.2 with SHA-256: ebe3fa...ecac61.

  3. 3.

    Version 10.0.0: https://releases.llvm.org/10.0.0/tools/clang.

  4. 4.

    MD5: fcd7227891271a65b729a27de962c0cb.

  5. 5.

    MD5: 276c28759d06e09a28524fffc2812580.

References

  1. Abuhamad, M., AbuHmed, T., Mohaisen, A., Nyang, D.: Large-scale and language-oblivious code authorship identification. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 101–114 (2018)

    Google Scholar 

  2. Afroz, S., Islam, A.C., Stolerman, A., Greenstadt, R., McCoy, D.: Doppelgänger finder: taking stylometry to the underground. In: 2014 IEEE Symposium on Security and Privacy, pp. 212–226. IEEE (2014)

    Google Scholar 

  3. APT trends report Q1 2020 (2020). https://securelist.com/apt-trends-report-q1-2020/96826/. Accessed 05 July 2020

  4. Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86–95. IEEE (1995)

    Google Scholar 

  5. Baxter, I.D., Pidgeon, C., Mehlich, M.: DMS/SPL REG: program transformations for practical scalable software evolution. In: Proceedings of 26th International Conference on Software Engineering, pp. 625–634. IEEE (2004)

    Google Scholar 

  6. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS, vol. 9, pp. 8–11. Citeseer (2009)

    Google Scholar 

  7. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  8. Bindiff: a comparison tool for binary files. https://www.zynamics.com/bindiff.html (2020). Accessed 05 May 2020

  9. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Advances in Neural Information Processing Systems, pp. 737–744 (1994)

    Google Scholar 

  10. Brumley, D., Poosankam, P., Song, D., Zheng, J.: Automatic patch-based exploit generation is possible: Techniques and implications. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 143–157. IEEE (2008)

    Google Scholar 

  11. Caliskan, A., et al.: When coding style survives compilation: de-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015)

  12. Caliskan-Islam, A., et al.: De-anonymizing programmers via code stylometry. In: 24th USENIX Security Symposium (USENIX Security 2015), pp. 255–270 (2015)

    Google Scholar 

  13. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  14. APT Groups Target Healthcare and Essential Services. https://us-cert.cisa.gov/ncas/alerts/AA20126A (2020). Accessed 05 May 2020

  15. Dask: A flexible library for parallel computing in python. https://docs.dask.org (2018). Accessed 05 May 2020

  16. Dauber, E., et al.: Git blame who?: stylistic authorship attribution of small, incomplete source code fragments. Proc. Privacy Enhanc. Technol. 2019(3), 389–408 (2019)

    Article  Google Scholar 

  17. DLL Files. https://www.dll-files.com (2020). Accessed 14 Mar 2020

  18. Ducau, F.N., Rudd, E.M., Heppner, T.M., Long, A., Berlin, K.: SMART: semantic malware attribute relevance tagging. CoRR abs/1905.06262 (2019). http://arxiv.org/abs/1905.06262

  19. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovre: efficient cross-architecture identification of bugs in binary code. In: NDSS (2016)

    Google Scholar 

  20. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)

    Google Scholar 

  21. Feng, C., Li, T., Chana, D.: Multi-level anomaly detection in industrial control systems via package signatures and LSTM networks. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 261–272. IEEE (2017)

    Google Scholar 

  22. Advanced Persistent Threat Groups. https://www.fireeye.com/current-threats/apt-groups.html (2020). Accessed 14 Mar 2020

  23. Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16

    Chapter  Google Scholar 

  24. Ghidra: A software reverse engineering (SRE) suite of tools developed by NSA’s Research Directorate. https://ghidra-sre.org (2020). Accessed 14 Mar 2020

  25. Graziano, M., et al.: Needles in a haystack: mining information from public dynamic analysis sandboxes for malware intelligence. In: 24th USENIX Security Symposium (USENIX Security 2015), pp. 1057–1072 (2015)

    Google Scholar 

  26. Guo, T., Xu, Z., Yao, X., Chen, H., Aberer, K., Funaya, K.: Robust online time series prediction with recurrent neural networks. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 816–825. IEEE (2016)

    Google Scholar 

  27. Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: UNICORN: runtime provenance-based detector for advanced persistent threats. In: NDSS (2020)

    Google Scholar 

  28. Hardy, S., et al.: Targeted threat index: Characterizing and quantifying politically-motivated targeted malware. In: 23rd USENIX Security Symposium (USENIX Security 2014), pp. 527–541 (2014)

    Google Scholar 

  29. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  30. Hossain, M.N., et al.: SLEUTH: real-time attack scenario reconstruction from COTS audit data. In: 26th USENIX Security Symposium (USENIX Security 2017), pp. 487–504 (2017)

    Google Scholar 

  31. Hu, X., Shin, K.G.: Duet: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, pp. 79–88 (2013)

    Google Scholar 

  32. Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: scalable malware clustering based on static features. In: 2013 USENIX Annual Technical Conference (USENIX ATC 2013), pp. 187–198 (2013)

    Google Scholar 

  33. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)

    Google Scholar 

  34. Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: 22nd USENIX Security Symposium (USENIX Security 2013), pp. 81–96 (2013)

    Google Scholar 

  35. Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)

    Google Scholar 

  36. Targeted Cyberattacks Logbook. https://apt.securelist.com/#!/threats/ (2018). Accessed 14 Mar 2020

  37. Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-47764-0_3

    Chapter  Google Scholar 

  38. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digit. Invest. 3, 91–97 (2006)

    Article  Google Scholar 

  39. Lastline. https://www.lastline.com (2021). Accessed 04 May 2021

  40. Le Blond, S., Uritesc, A., Gilbert, C., Chua, Z.L., Saxena, P., Kirda, E.: A look at targeted attacks through the lense of an NGO. In: 23rd USENIX Security Symposium (USENIX Security 2014), pp. 543–558 (2014)

    Google Scholar 

  41. Li, Y., et al.: Experimental study of fuzzy hashing in malware clustering analysis. In: 8th Workshop on Cyber Security Experimentation and Test (CSET 2015) (2015)

    Google Scholar 

  42. Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-miner: a tool for finding copy-paste and related bugs in operating system code. OSdi 4, 289–302 (2004)

    Google Scholar 

  43. Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 389–400 (2014)

    Google Scholar 

  44. Masci, J., Bronstein, M.M., Bronstein, A.M., Schmidhuber, J.: Multimodal similarity-preserving hashing. IEEE Trans. Pattern Anal. Mach. Intell. 36(4), 824–830 (2013)

    Article  Google Scholar 

  45. McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)

    Article  MathSciNet  Google Scholar 

  46. McInnes, L., Healy, J.: Accelerated hierarchical density based clustering. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. IEEE (2017)

    Google Scholar 

  47. Meng, X., Miller, B.P., Jun, K.-S.: Identifying multiple authors in a binary program. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 286–304. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66399-9_16

    Chapter  Google Scholar 

  48. Milajerdi, S.M., Gjomemo, R., Eshete, B., Sekar, R., Venkatakrishnan, V.: Holmes: real-time apt detection through correlation of suspicious information flows. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 1137–1152. IEEE (2019)

    Google Scholar 

  49. Mimikatz: an open-source application for veiwing and saving authentication credentials (2014). https://github.com/gentilkiwi/mimikatz. Accessed 05 May 2020

  50. MITRE ATT&CK: a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations (2020). https://attack.mitre.org/. Accessed 14 Mar 2020

  51. Moonlight - Targeted attacks in the Middle East (2016). https://tinyurl.com/45m3jtx8. Accessed 05 July 2020

  52. Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  53. Meet the threat actors: List of APTs and adversary groups (2019). https://www.crowdstrike.com/blog/meet-the-adversaries/. Accessed 05 May 2020

  54. Vietnamese Threat Actors APT32 Targeting Wuhan Government and Chinese Ministry of Emergency Management in Latest Example of COVID-19 Related Espionage (2020). https://tinyurl.com/7whx7ecr. Accessed 05 May 2020

  55. Cyber espionage is alive and well: Apt32 and the threat to global corporations (2017). https://tinyurl.com/54eact6v. Accessed 05 May 2020

  56. Oh Song, H., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility location. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390 (2017)

    Google Scholar 

  57. Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012 (2016)

    Google Scholar 

  58. Oliver, J., Cheng, C., Chen, Y.: TLSH-a locality sensitive hash. In: 2013 Fourth Cybercrime and Trustworthy Computing Workshop, pp. 7–13. IEEE (2013)

    Google Scholar 

  59. Benchmarking performance and scaling of python clustering algorithms (2020). https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html. Accessed 05 May 2020

  60. Pewny, J., Garmany, B., Gawlik, R., Rossow, C., Holz, T.: Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy, pp. 709–724. IEEE (2015)

    Google Scholar 

  61. Rosenblum, N., Zhu, X., Miller, B.P.: Who wrote this code? Identifying the authors of program binaries. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 172–189. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23822-2_10

    Chapter  Google Scholar 

  62. Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D., Su, Z.: Detecting code clones in binary executables. In: Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, pp. 117–128 (2009)

    Google Scholar 

  63. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)

    Google Scholar 

  64. Scrutinizer: Detecting code reuse in malware via decompilation and machine learning (2021). https://github.com/OMirzaei/SCRUTINIZER. Accessed 04 May 2021

  65. THREAT GROUP CARDS: A threat actor encyclopedia (2019). https://tinyurl.com/bb8mt23k. Accessed 05 Oct 2019

  66. ThreatMiner: Data Mining for Threat Intelligence (2020). https://www.threatminer.org/index.php. Accessed 14 Mar 2020

  67. Upchurch, J., Zhou, X.: Variant: a malware similarity testing framework. In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 31–39. IEEE (2015)

    Google Scholar 

  68. Verizon’s 2020 data breach investigations report (2020). https://tinyurl.com/56m7m9ym. Accessed 05 May 2020

  69. VirusTotal (2020). https://www.virustotal.com/gui/home/search. Accessed 05 June 2020

  70. Waterbug: Espionage Group Rolls Out Brand-New Toolset in Attacks Against Governments (2020). https://tinyurl.com/92s76xdn. Accessed 05 May 2020

  71. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376 (2017)

    Google Scholar 

Download references

Acknowledgement

This work was partially-supported by National Science Foundation (NSF) under grant CNS-1703454, and the Office of Naval Research (ONR) under the “In Situ Malware” project. This work was also partially-supported by Secure Business Austria.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omid Mirzaei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mirzaei, O., Vasilenko, R., Kirda, E., Lu, L., Kharraz, A. (2021). SCRUTINIZER: Detecting Code Reuse in Malware via Decompilation and Machine Learning. In: Bilge, L., Cavallaro, L., Pellegrino, G., Neves, N. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2021. Lecture Notes in Computer Science(), vol 12756. Springer, Cham. https://doi.org/10.1007/978-3-030-80825-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-80825-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-80824-2

  • Online ISBN: 978-3-030-80825-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics