Abstract
Growing numbers of advanced malware-based attacks against governments and corporations, for political, financial and scientific gains, have taken security breaches to the next level. In response to such attacks, both academia and industry have investigated techniques to model and reconstruct these attacks and to defend against them. While such efforts have been all useful in mitigating the effects of modern attacks, automated malware code reuse inspection and campaign attribution have received less attention.
In this paper, we present an automated system, called SCRUTINIZER, to identify code reuse in malware via a novel machine learning-based encoding mechanism at the function-level. By creating a large knowledge base of previously observed and tagged malware campaigns, we can compare unknown samples against this knowledge base and determine how much overlap exists. SCRUTINIZER leverages an unsupervised learning approach to filter out irrelevant functions before code reuse detection. It provides two valuable capabilities. First, it identifies ties between an unknown sample and those malware specimens that are known to be used by a specific campaign. Second, it inspects if specific tools or functionalities are used by a campaign. Using SCRUTINIZER, we were able to identify 12 samples that were previously unknown to us and that we were able to correctly assign to well-known APT campaigns.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We plan to release a labeled dataset of malware binaries that have been used by different APT campaigns that we have access to.
- 2.
Version 9.1.2 with SHA-256: ebe3fa...ecac61.
- 3.
Version 10.0.0: https://releases.llvm.org/10.0.0/tools/clang.
- 4.
MD5: fcd7227891271a65b729a27de962c0cb.
- 5.
MD5: 276c28759d06e09a28524fffc2812580.
References
Abuhamad, M., AbuHmed, T., Mohaisen, A., Nyang, D.: Large-scale and language-oblivious code authorship identification. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 101–114 (2018)
Afroz, S., Islam, A.C., Stolerman, A., Greenstadt, R., McCoy, D.: Doppelgänger finder: taking stylometry to the underground. In: 2014 IEEE Symposium on Security and Privacy, pp. 212–226. IEEE (2014)
APT trends report Q1 2020 (2020). https://securelist.com/apt-trends-report-q1-2020/96826/. Accessed 05 July 2020
Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86–95. IEEE (1995)
Baxter, I.D., Pidgeon, C., Mehlich, M.: DMS/SPL REG: program transformations for practical scalable software evolution. In: Proceedings of 26th International Conference on Software Engineering, pp. 625–634. IEEE (2004)
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS, vol. 9, pp. 8–11. Citeseer (2009)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Bindiff: a comparison tool for binary files. https://www.zynamics.com/bindiff.html (2020). Accessed 05 May 2020
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Advances in Neural Information Processing Systems, pp. 737–744 (1994)
Brumley, D., Poosankam, P., Song, D., Zheng, J.: Automatic patch-based exploit generation is possible: Techniques and implications. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 143–157. IEEE (2008)
Caliskan, A., et al.: When coding style survives compilation: de-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015)
Caliskan-Islam, A., et al.: De-anonymizing programmers via code stylometry. In: 24th USENIX Security Symposium (USENIX Security 2015), pp. 255–270 (2015)
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
APT Groups Target Healthcare and Essential Services. https://us-cert.cisa.gov/ncas/alerts/AA20126A (2020). Accessed 05 May 2020
Dask: A flexible library for parallel computing in python. https://docs.dask.org (2018). Accessed 05 May 2020
Dauber, E., et al.: Git blame who?: stylistic authorship attribution of small, incomplete source code fragments. Proc. Privacy Enhanc. Technol. 2019(3), 389–408 (2019)
DLL Files. https://www.dll-files.com (2020). Accessed 14 Mar 2020
Ducau, F.N., Rudd, E.M., Heppner, T.M., Long, A., Berlin, K.: SMART: semantic malware attribute relevance tagging. CoRR abs/1905.06262 (2019). http://arxiv.org/abs/1905.06262
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovre: efficient cross-architecture identification of bugs in binary code. In: NDSS (2016)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Feng, C., Li, T., Chana, D.: Multi-level anomaly detection in industrial control systems via package signatures and LSTM networks. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 261–272. IEEE (2017)
Advanced Persistent Threat Groups. https://www.fireeye.com/current-threats/apt-groups.html (2020). Accessed 14 Mar 2020
Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16
Ghidra: A software reverse engineering (SRE) suite of tools developed by NSA’s Research Directorate. https://ghidra-sre.org (2020). Accessed 14 Mar 2020
Graziano, M., et al.: Needles in a haystack: mining information from public dynamic analysis sandboxes for malware intelligence. In: 24th USENIX Security Symposium (USENIX Security 2015), pp. 1057–1072 (2015)
Guo, T., Xu, Z., Yao, X., Chen, H., Aberer, K., Funaya, K.: Robust online time series prediction with recurrent neural networks. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 816–825. IEEE (2016)
Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: UNICORN: runtime provenance-based detector for advanced persistent threats. In: NDSS (2020)
Hardy, S., et al.: Targeted threat index: Characterizing and quantifying politically-motivated targeted malware. In: 23rd USENIX Security Symposium (USENIX Security 2014), pp. 527–541 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hossain, M.N., et al.: SLEUTH: real-time attack scenario reconstruction from COTS audit data. In: 26th USENIX Security Symposium (USENIX Security 2017), pp. 487–504 (2017)
Hu, X., Shin, K.G.: Duet: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, pp. 79–88 (2013)
Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: scalable malware clustering based on static features. In: 2013 USENIX Annual Technical Conference (USENIX ATC 2013), pp. 187–198 (2013)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: 22nd USENIX Security Symposium (USENIX Security 2013), pp. 81–96 (2013)
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)
Targeted Cyberattacks Logbook. https://apt.securelist.com/#!/threats/ (2018). Accessed 14 Mar 2020
Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-47764-0_3
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digit. Invest. 3, 91–97 (2006)
Lastline. https://www.lastline.com (2021). Accessed 04 May 2021
Le Blond, S., Uritesc, A., Gilbert, C., Chua, Z.L., Saxena, P., Kirda, E.: A look at targeted attacks through the lense of an NGO. In: 23rd USENIX Security Symposium (USENIX Security 2014), pp. 543–558 (2014)
Li, Y., et al.: Experimental study of fuzzy hashing in malware clustering analysis. In: 8th Workshop on Cyber Security Experimentation and Test (CSET 2015) (2015)
Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-miner: a tool for finding copy-paste and related bugs in operating system code. OSdi 4, 289–302 (2004)
Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 389–400 (2014)
Masci, J., Bronstein, M.M., Bronstein, A.M., Schmidhuber, J.: Multimodal similarity-preserving hashing. IEEE Trans. Pattern Anal. Mach. Intell. 36(4), 824–830 (2013)
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)
McInnes, L., Healy, J.: Accelerated hierarchical density based clustering. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. IEEE (2017)
Meng, X., Miller, B.P., Jun, K.-S.: Identifying multiple authors in a binary program. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 286–304. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66399-9_16
Milajerdi, S.M., Gjomemo, R., Eshete, B., Sekar, R., Venkatakrishnan, V.: Holmes: real-time apt detection through correlation of suspicious information flows. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 1137–1152. IEEE (2019)
Mimikatz: an open-source application for veiwing and saving authentication credentials (2014). https://github.com/gentilkiwi/mimikatz. Accessed 05 May 2020
MITRE ATT&CK: a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations (2020). https://attack.mitre.org/. Accessed 14 Mar 2020
Moonlight - Targeted attacks in the Middle East (2016). https://tinyurl.com/45m3jtx8. Accessed 05 July 2020
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Meet the threat actors: List of APTs and adversary groups (2019). https://www.crowdstrike.com/blog/meet-the-adversaries/. Accessed 05 May 2020
Vietnamese Threat Actors APT32 Targeting Wuhan Government and Chinese Ministry of Emergency Management in Latest Example of COVID-19 Related Espionage (2020). https://tinyurl.com/7whx7ecr. Accessed 05 May 2020
Cyber espionage is alive and well: Apt32 and the threat to global corporations (2017). https://tinyurl.com/54eact6v. Accessed 05 May 2020
Oh Song, H., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility location. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390 (2017)
Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012 (2016)
Oliver, J., Cheng, C., Chen, Y.: TLSH-a locality sensitive hash. In: 2013 Fourth Cybercrime and Trustworthy Computing Workshop, pp. 7–13. IEEE (2013)
Benchmarking performance and scaling of python clustering algorithms (2020). https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html. Accessed 05 May 2020
Pewny, J., Garmany, B., Gawlik, R., Rossow, C., Holz, T.: Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy, pp. 709–724. IEEE (2015)
Rosenblum, N., Zhu, X., Miller, B.P.: Who wrote this code? Identifying the authors of program binaries. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 172–189. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23822-2_10
Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D., Su, Z.: Detecting code clones in binary executables. In: Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, pp. 117–128 (2009)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
Scrutinizer: Detecting code reuse in malware via decompilation and machine learning (2021). https://github.com/OMirzaei/SCRUTINIZER. Accessed 04 May 2021
THREAT GROUP CARDS: A threat actor encyclopedia (2019). https://tinyurl.com/bb8mt23k. Accessed 05 Oct 2019
ThreatMiner: Data Mining for Threat Intelligence (2020). https://www.threatminer.org/index.php. Accessed 14 Mar 2020
Upchurch, J., Zhou, X.: Variant: a malware similarity testing framework. In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 31–39. IEEE (2015)
Verizon’s 2020 data breach investigations report (2020). https://tinyurl.com/56m7m9ym. Accessed 05 May 2020
VirusTotal (2020). https://www.virustotal.com/gui/home/search. Accessed 05 June 2020
Waterbug: Espionage Group Rolls Out Brand-New Toolset in Attacks Against Governments (2020). https://tinyurl.com/92s76xdn. Accessed 05 May 2020
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376 (2017)
Acknowledgement
This work was partially-supported by National Science Foundation (NSF) under grant CNS-1703454, and the Office of Naval Research (ONR) under the “In Situ Malware” project. This work was also partially-supported by Secure Business Austria.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mirzaei, O., Vasilenko, R., Kirda, E., Lu, L., Kharraz, A. (2021). SCRUTINIZER: Detecting Code Reuse in Malware via Decompilation and Machine Learning. In: Bilge, L., Cavallaro, L., Pellegrino, G., Neves, N. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2021. Lecture Notes in Computer Science(), vol 12756. Springer, Cham. https://doi.org/10.1007/978-3-030-80825-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-80825-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80824-2
Online ISBN: 978-3-030-80825-9
eBook Packages: Computer ScienceComputer Science (R0)