SCRUTINIZER: Detecting Code Reuse in Malware via Decompilation and Machine Learning

Mirzaei, Omid; Vasilenko, Roman; Kirda, Engin; Lu, Long; Kharraz, Amin

doi:10.1007/978-3-030-80825-9_7

Omid Mirzaei¹²,
Roman Vasilenko¹³,
Engin Kirda¹²,
Long Lu¹² &
…
Amin Kharraz¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12756))

Included in the following conference series:

International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment

1629 Accesses
4 Citations

Abstract

Growing numbers of advanced malware-based attacks against governments and corporations, for political, financial and scientific gains, have taken security breaches to the next level. In response to such attacks, both academia and industry have investigated techniques to model and reconstruct these attacks and to defend against them. While such efforts have been all useful in mitigating the effects of modern attacks, automated malware code reuse inspection and campaign attribution have received less attention.

In this paper, we present an automated system, called SCRUTINIZER, to identify code reuse in malware via a novel machine learning-based encoding mechanism at the function-level. By creating a large knowledge base of previously observed and tagged malware campaigns, we can compare unknown samples against this knowledge base and determine how much overlap exists. SCRUTINIZER leverages an unsupervised learning approach to filter out irrelevant functions before code reuse detection. It provides two valuable capabilities. First, it identifies ties between an unknown sample and those malware specimens that are known to be used by a specific campaign. Second, it inspects if specific tools or functionalities are used by a campaign. Using SCRUTINIZER, we were able to identify 12 samples that were previously unknown to us and that we were able to correctly assign to well-known APT campaigns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We plan to release a labeled dataset of malware binaries that have been used by different APT campaigns that we have access to.
2.
Version 9.1.2 with SHA-256: ebe3fa...ecac61.
3.
Version 10.0.0: https://releases.llvm.org/10.0.0/tools/clang.
4.
MD5: fcd7227891271a65b729a27de962c0cb.
5.
MD5: 276c28759d06e09a28524fffc2812580.

References

Abuhamad, M., AbuHmed, T., Mohaisen, A., Nyang, D.: Large-scale and language-oblivious code authorship identification. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 101–114 (2018)
Google Scholar
Afroz, S., Islam, A.C., Stolerman, A., Greenstadt, R., McCoy, D.: Doppelgänger finder: taking stylometry to the underground. In: 2014 IEEE Symposium on Security and Privacy, pp. 212–226. IEEE (2014)
Google Scholar
APT trends report Q1 2020 (2020). https://securelist.com/apt-trends-report-q1-2020/96826/. Accessed 05 July 2020
Baker, B.S.: On finding duplication and near-duplication in large software systems. In: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86–95. IEEE (1995)
Google Scholar
Baxter, I.D., Pidgeon, C., Mehlich, M.: DMS/SPL REG: program transformations for practical scalable software evolution. In: Proceedings of 26th International Conference on Software Engineering, pp. 625–634. IEEE (2004)
Google Scholar
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS, vol. 9, pp. 8–11. Citeseer (2009)
Google Scholar
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Bindiff: a comparison tool for binary files. https://www.zynamics.com/bindiff.html (2020). Accessed 05 May 2020
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Advances in Neural Information Processing Systems, pp. 737–744 (1994)
Google Scholar
Brumley, D., Poosankam, P., Song, D., Zheng, J.: Automatic patch-based exploit generation is possible: Techniques and implications. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 143–157. IEEE (2008)
Google Scholar
Caliskan, A., et al.: When coding style survives compilation: de-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015)
Caliskan-Islam, A., et al.: De-anonymizing programmers via code stylometry. In: 24th USENIX Security Symposium (USENIX Security 2015), pp. 255–270 (2015)
Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
APT Groups Target Healthcare and Essential Services. https://us-cert.cisa.gov/ncas/alerts/AA20126A (2020). Accessed 05 May 2020
Dask: A flexible library for parallel computing in python. https://docs.dask.org (2018). Accessed 05 May 2020
Dauber, E., et al.: Git blame who?: stylistic authorship attribution of small, incomplete source code fragments. Proc. Privacy Enhanc. Technol. 2019(3), 389–408 (2019)
Article Google Scholar
DLL Files. https://www.dll-files.com (2020). Accessed 14 Mar 2020
Ducau, F.N., Rudd, E.M., Heppner, T.M., Long, A., Berlin, K.: SMART: semantic malware attribute relevance tagging. CoRR abs/1905.06262 (2019). http://arxiv.org/abs/1905.06262
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovre: efficient cross-architecture identification of bugs in binary code. In: NDSS (2016)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96, 226–231 (1996)
Google Scholar
Feng, C., Li, T., Chana, D.: Multi-level anomaly detection in industrial control systems via package signatures and LSTM networks. In: 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 261–272. IEEE (2017)
Google Scholar
Advanced Persistent Threat Groups. https://www.fireeye.com/current-threats/apt-groups.html (2020). Accessed 14 Mar 2020
Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16
Chapter Google Scholar
Ghidra: A software reverse engineering (SRE) suite of tools developed by NSA’s Research Directorate. https://ghidra-sre.org (2020). Accessed 14 Mar 2020
Graziano, M., et al.: Needles in a haystack: mining information from public dynamic analysis sandboxes for malware intelligence. In: 24th USENIX Security Symposium (USENIX Security 2015), pp. 1057–1072 (2015)
Google Scholar
Guo, T., Xu, Z., Yao, X., Chen, H., Aberer, K., Funaya, K.: Robust online time series prediction with recurrent neural networks. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 816–825. IEEE (2016)
Google Scholar
Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: UNICORN: runtime provenance-based detector for advanced persistent threats. In: NDSS (2020)
Google Scholar
Hardy, S., et al.: Targeted threat index: Characterizing and quantifying politically-motivated targeted malware. In: 23rd USENIX Security Symposium (USENIX Security 2014), pp. 527–541 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hossain, M.N., et al.: SLEUTH: real-time attack scenario reconstruction from COTS audit data. In: 26th USENIX Security Symposium (USENIX Security 2017), pp. 487–504 (2017)
Google Scholar
Hu, X., Shin, K.G.: Duet: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, pp. 79–88 (2013)
Google Scholar
Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: scalable malware clustering based on static features. In: 2013 USENIX Annual Technical Conference (USENIX ATC 2013), pp. 187–198 (2013)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Google Scholar
Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: 22nd USENIX Security Symposium (USENIX Security 2013), pp. 81–96 (2013)
Google Scholar
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: scalable and accurate tree-based detection of code clones. In: 29th International Conference on Software Engineering (ICSE 2007), pp. 96–105. IEEE (2007)
Google Scholar
Targeted Cyberattacks Logbook. https://apt.securelist.com/#!/threats/ (2018). Accessed 14 Mar 2020
Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-47764-0_3
Chapter Google Scholar
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digit. Invest. 3, 91–97 (2006)
Article Google Scholar
Lastline. https://www.lastline.com (2021). Accessed 04 May 2021
Le Blond, S., Uritesc, A., Gilbert, C., Chua, Z.L., Saxena, P., Kirda, E.: A look at targeted attacks through the lense of an NGO. In: 23rd USENIX Security Symposium (USENIX Security 2014), pp. 543–558 (2014)
Google Scholar
Li, Y., et al.: Experimental study of fuzzy hashing in malware clustering analysis. In: 8th Workshop on Cyber Security Experimentation and Test (CSET 2015) (2015)
Google Scholar
Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-miner: a tool for finding copy-paste and related bugs in operating system code. OSdi 4, 289–302 (2004)
Google Scholar
Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 389–400 (2014)
Google Scholar
Masci, J., Bronstein, M.M., Bronstein, A.M., Schmidhuber, J.: Multimodal similarity-preserving hashing. IEEE Trans. Pattern Anal. Mach. Intell. 36(4), 824–830 (2013)
Article Google Scholar
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)
Article MathSciNet Google Scholar
McInnes, L., Healy, J.: Accelerated hierarchical density based clustering. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. IEEE (2017)
Google Scholar
Meng, X., Miller, B.P., Jun, K.-S.: Identifying multiple authors in a binary program. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 286–304. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66399-9_16
Chapter Google Scholar
Milajerdi, S.M., Gjomemo, R., Eshete, B., Sekar, R., Venkatakrishnan, V.: Holmes: real-time apt detection through correlation of suspicious information flows. In: 2019 IEEE Symposium on Security and Privacy (SP), pp. 1137–1152. IEEE (2019)
Google Scholar
Mimikatz: an open-source application for veiwing and saving authentication credentials (2014). https://github.com/gentilkiwi/mimikatz. Accessed 05 May 2020
MITRE ATT&CK: a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations (2020). https://attack.mitre.org/. Accessed 14 Mar 2020
Moonlight - Targeted attacks in the Middle East (2016). https://tinyurl.com/45m3jtx8. Accessed 05 July 2020
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Meet the threat actors: List of APTs and adversary groups (2019). https://www.crowdstrike.com/blog/meet-the-adversaries/. Accessed 05 May 2020
Vietnamese Threat Actors APT32 Targeting Wuhan Government and Chinese Ministry of Emergency Management in Latest Example of COVID-19 Related Espionage (2020). https://tinyurl.com/7whx7ecr. Accessed 05 May 2020
Cyber espionage is alive and well: Apt32 and the threat to global corporations (2017). https://tinyurl.com/54eact6v. Accessed 05 May 2020
Oh Song, H., Jegelka, S., Rathod, V., Murphy, K.: Deep metric learning via facility location. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390 (2017)
Google Scholar
Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012 (2016)
Google Scholar
Oliver, J., Cheng, C., Chen, Y.: TLSH-a locality sensitive hash. In: 2013 Fourth Cybercrime and Trustworthy Computing Workshop, pp. 7–13. IEEE (2013)
Google Scholar
Benchmarking performance and scaling of python clustering algorithms (2020). https://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html. Accessed 05 May 2020
Pewny, J., Garmany, B., Gawlik, R., Rossow, C., Holz, T.: Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy, pp. 709–724. IEEE (2015)
Google Scholar
Rosenblum, N., Zhu, X., Miller, B.P.: Who wrote this code? Identifying the authors of program binaries. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 172–189. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23822-2_10
Chapter Google Scholar
Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D., Su, Z.: Detecting code clones in binary executables. In: Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, pp. 117–128 (2009)
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
Google Scholar
Scrutinizer: Detecting code reuse in malware via decompilation and machine learning (2021). https://github.com/OMirzaei/SCRUTINIZER. Accessed 04 May 2021
THREAT GROUP CARDS: A threat actor encyclopedia (2019). https://tinyurl.com/bb8mt23k. Accessed 05 Oct 2019
ThreatMiner: Data Mining for Threat Intelligence (2020). https://www.threatminer.org/index.php. Accessed 14 Mar 2020
Upchurch, J., Zhou, X.: Variant: a malware similarity testing framework. In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 31–39. IEEE (2015)
Google Scholar
Verizon’s 2020 data breach investigations report (2020). https://tinyurl.com/56m7m9ym. Accessed 05 May 2020
VirusTotal (2020). https://www.virustotal.com/gui/home/search. Accessed 05 June 2020
Waterbug: Espionage Group Rolls Out Brand-New Toolset in Attacks Against Governments (2020). https://tinyurl.com/92s76xdn. Accessed 05 May 2020
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376 (2017)
Google Scholar

Download references

Acknowledgement

This work was partially-supported by National Science Foundation (NSF) under grant CNS-1703454, and the Office of Naval Research (ONR) under the “In Situ Malware” project. This work was also partially-supported by Secure Business Austria.

Author information

Authors and Affiliations

Northeastern University, Boston, USA
Omid Mirzaei, Engin Kirda & Long Lu
VMware, Boston, USA
Roman Vasilenko
Florida International University, Miami, USA
Amin Kharraz

Authors

Omid Mirzaei
View author publications
You can also search for this author in PubMed Google Scholar
Roman Vasilenko
View author publications
You can also search for this author in PubMed Google Scholar
Engin Kirda
View author publications
You can also search for this author in PubMed Google Scholar
Long Lu
View author publications
You can also search for this author in PubMed Google Scholar
Amin Kharraz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omid Mirzaei .

Editor information

Editors and Affiliations

NortonLifeLock Research Group, Biot, France
Leyla Bilge
King's College London, London, UK
Lorenzo Cavallaro
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Giancarlo Pellegrino
University of Lisbon, Lisbon, Portugal
Nuno Neves

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mirzaei, O., Vasilenko, R., Kirda, E., Lu, L., Kharraz, A. (2021). SCRUTINIZER: Detecting Code Reuse in Malware via Decompilation and Machine Learning. In: Bilge, L., Cavallaro, L., Pellegrino, G., Neves, N. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2021. Lecture Notes in Computer Science(), vol 12756. Springer, Cham. https://doi.org/10.1007/978-3-030-80825-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-80825-9_7
Published: 09 July 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80824-2
Online ISBN: 978-3-030-80825-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SCRUTINIZER: Detecting Code Reuse in Malware via Decompilation and Machine Learning