Abstract
To date, most analysis of collaboration between malware authors has been performed on meta-data and compiled binaries, while ignoring artifacts present in the source code. We collect a vast amount of malicious source code from Underground Forums posts, Underground Forum code attachments, and GitHub repositories and devise a methodology that allows us to filter out most auxiliary code, leaving the measurement to focus on malicious code. We leverage this to perform an in-depth measurement of the reuse of malicious code between these malware centers as well as StackOverflow. We find that our methodology has high precision in identifying malicious code (93.1%) and provides a contemporary snapshot of malware code reuse across the Web, offering insights into the manners in which this takes place.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Baltes, S., Diehl, S.: Usage and attribution of stack overflow code snippets in github projects. Empir. Softw. Eng. 24(3), 1259–1295 (2019)
Calleja, A., Tapiador, J., Caballero, J.: The malsource dataset: quantifying complexity and code reuse in malware development. IEEE Trans. Inf. Forensics Secur. 14(12), 3175–3190 (2018)
Cheng, X., Jiang, L., Zhong, H., Yu, H., Zhao, J.: On the feasibility of detecting cross-platform code clones via identifier similarity. In: Proceedings of the 5th International Workshop on Software Mining, KDD, pp. 39–42 (2016)
Islam, R., Rokon, M.O.F., Darki, A., Faloutsos, M.: Hackerscope: the dynamics of a massive hacker online ecosystem. arXiv preprint arXiv:2011.07222 (2020)
Moradi-Jamei, B., Kramer, B.L., Calderón, J.B.S., Korkmaz, G.: Community formation and detection on github collaboration networks. In: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, KDD, pp. 244–251 (2021)
Nafi, K.W., Kar, T.S., Roy, B., Roy, C.K., Schneider, K.A.: CLCDSA: cross language code clone detection using syntactical features and API documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1026–1037. IEEE (2019)
Nakagawa, T., Higo, Y., Kusumoto, S.: Nil: large-scale detection of large-variance clones. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 830–841 (2021)
Pastrana, S., Thomas, D.R., Hutchings, A., Clayton, R.: Crimebb: enabling cybercrime research on underground forums at scale. In: Proceedings of the 2018 World Wide Web Conference, pp. 1845–1854 (2018)
Qian, Y., Zhang, Y., Chawla, N., Ye, Y., Zhang, C.: Malicious repositories detection with adversarial heterogeneous graph contrastive learning. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1645–1654 (2022)
Ragkhitwetsagul, C., Krinke, J., Clark, D.: A comparison of code similarity analysers. Empir. Softw. Eng. 23(4), 2464–2519 (2018)
Ragkhitwetsagul, C., Krinke, J., Paixao, M., Bianco, G., Oliveto, R.: Toxic code snippets on stack overflow. IEEE Trans. Software Eng. 47(3), 560–581 (2019)
Rokon, M.O.F., Islam, R., Darki, A., Papalexakis, E.E., Faloutsos, M.: Sourcefinder: finding malware source-code from publicly available repositories in github. In: 23rd International Symposium on Research in Attacks, Intrusions and Defenses (\(\{\)RAID\(\}\) 2020), pp. 149–163 (2020)
Rokon, M.O.F., Yan, P., Islam, R., Faloutsos, M.: Repo2vec: a comprehensive embedding approach for determining repository similarity. In: 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 355–365. IEEE (2021)
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s Sch. Comput. TR 541(115), 64–68 (2007)
Saini, V., Farmahinifarahani, F., Lu, Y., Baldi, P., Lopes, C.V.: Oreo: detection of clones in the twilight zone. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 354–365 (2018)
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: Sourcerercc: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016)
Svajlenko, J., Roy, C.K.: Cloneworks: a fast and flexible large-scale near-miss clone detection tool. In: ICSE (Companion Volume), pp. 177–179 (2017)
Thomas, K., et al.: Framing dependencies introduced by underground commoditization. In: Proceedings of the Workshop on the Economics of Information Security (WEIS) (2015)
Weaver, N., Paxson, V., Staniford, S., Cunningham, R.: Large scale malicious code: a research agenda (2003)
Yahya, M.A., Kim, D.K.: CLCD-I: cross-language clone detection by using deep learning with infercode. Computers 12(1), 12 (2023)
Yang, D., Martins, P., Saini, V., Lopes, C.: Stack overflow in github: any snippets there? In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 280–290. IEEE (2017)
yoeo: Guesslang (2020). https://github.com/yoeo/guesslang
Acknowledgements
This project was funded by TED2021-132900A-I00, from the Spanish Ministry of Science and Innovation, with funds from MCIN/AEI /10.13039/501100011033, and the European Union-NextGenerationEU/PRTR; and by PID2022-143304OB-I00 funded by MCIN/AEI /10.13039/501100011033/ and the ERDF “A way of making Europe.” M. Tereszkowski-Kaminski’s work was supported by “Programa Investigo” grant 2022-C23.I01.P03.S0020-0000038, funded by the European Union NextGeneration-EU/PRTR and MITES/SEPE. G. Suarez-Tangil has been appointed as 2019 Ramon y Cajal fellow (RYC-2020-029401-I) funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Benign Datasets
These are used for finding code reuse between them and our corpus samples of that language during the Benign Function Filtering step of our methodology (refer to Sect. 4). Thus, we have 3 benign code datasets, one for each language category in the measurement. There is not a separate benign dataset for C on its own as the C and C++ measurements are done on the same samples.
50KC. This is a dataset of compilable Java projects from GitHub. 3,624 of these are included in our benign function filtering step.
Wild C++. This is a dataset of C++ function source code files gathered from GitHub repositories which contain C++ source code, queried for projects with at least 10 stars. 1,000,000 samples are included in our benign function filtering step. This is a fraction of the entire amount available, however we are limited by computational resources in clone detection.
Py150k. This is a dataset of 150k Python source files gathered from GitHub repositories. All 150,000 samples are included in our benign function filtering step.
B Prominent Measurement Clusters
1.1 B.1 C/C++ Clusters
We next describe the most prominent similarity matches in C/C++ code reuse:
-
Cluster #1: There are 1,781 UFSF repositories, 1,161 UF post threads, and 764 UF code attachments in the main supercluster. Within this supercluster, the nature and amount of reuse varies substantially. In places, there are localized subgraphs that have stronger connections to nodes within themselves than to the rest of the supercluster.
-
Cluster #2: These are 12 hacks for a videogame called DayZ. All 12 of these projects are code attachments from Underground Forums. They reuse a specific piece of malicious code which is a thread callback function.
-
Cluster #3: These are 11 hacks for a videogame called Grand Theft Auto 5. 10 of these are code attachments from Underground Forums, with the one remaining being a GitHub repository. They reuse a DLL-loading snippet.
-
Cluster #4: These are 9 Linux malware, 8 of them with Rootkit in the name, and all 9 GitHub repositories. The name of one of them suggests it was a homework assignment for a class. They reuse a piece of code that initializes the rootkit.
-
Cluster #5: These are 8 threads from an Underground Forum. Their code reuse centers around low-level memory manipulation with function names such as ModifyMemory() and WriteToMemory().
-
Cluster #6: These are 7 ransomware samples, all GitHub repositories, one of them being a collection repository for ransomware. The reuse present are functions that encrypt and decrypt files, as well as getting directory listings of files on the host system.
-
Cluster #7: These are 6 code attachments from Underground Forums, all 6 being hacks for the videogame Counter Strike: Global Offensive.
-
Cluster #8: These are 6 GitHub repositories dedicated to “hacking Windows memory”.
-
Cluster #9: These are 6 GitHub repositories which contain rootkits.
-
Cluster #10: These are 5 code attachments from Underground Forums which are videogame hacks. It appears they are different versions of the same hack, but we are unable to ascertain which videogame they target.
1.2 B.2 Java Clusters
We next describe the most prominent similarity matches seen in Java code reuse:
-
Cluster #1: There are 6 nodes in the network that come from Underground Forum snippets, making up 3.1% of the network. All of them are within the largest cluster. They deal with socket connections and user login functionality. Underground Forums and SourceFinder repositories on the other hand make up 89.8% of the network. These variants vary greatly in the malicious functionality they reuse, with some reusing socket connectivity code, and others network vulnerability scanning code. Still another sample obtained code from StackOverflow which would fake the working of threads using Thread.sleep() calls.
-
Cluster #2: The second largest cluster consists of 19 cryptocurrency miner GitHub repositories. The reuse centers around various functionality, from blockchain protocol implementations encryption implementations.
-
Cluster #3: The third largest cluster consists of 7 hacks for the video game Call of Duty: Modern Warfare 3, all existing in the network as source code attachments from Underground Forums.
-
Cluster #4-5: The fourth and fifth largest clusters both contain 4 projects each. One consists of 3 keyloggers and a Remote Administration Tool trojan with the code reused centered around keylogging activity. The other cluster consists of 4 blockchain implementations.
-
Cluster #6: This cluster consists of 3 forks of the same repository which contains miscellaneous hacking scripts by a hobbyist malware writer.
-
Cluster #7-9: The remaining clusters are pairs of nodes. We see a pair of hacks for the video game Realm of the Mad God, a pair of hacks for the video game Counter-Strike: Global Offensive, and a pair of GitHub repositories that collect malware samples.
1.3 B.3 Python Clusters
We next describe the most prominent similarity matches seen in Python code reuse:
-
Cluster #1: There are 1,615 UFSF repositories, 6 UF post threads and 12 UF code attachments in the main supercluster. 5 of these 12 are videogame hacks for the game PUBG. In places, there are localized subgraphs that have stronger connections to nodes within themselves than to the rest of the supercluster.
-
Cluster #2: These are 7 Remote Administration Tool GitHub repositories, 5 of them being versions of one and 2 of their versions of another. There is a big overlap between this cluster and Cluster #10 of C/C++ code. The code reuse includes writing and reading files and preparing to execute shellcode.
-
Cluster #3: These are 4 versions of one video game hack for the game Minecraft. They reuse code that deals with socket connections, among others.
-
Cluster #4: These are 4 GitHub repositories containing bitcoin miners.
-
Cluster #5: These are 3 GitHub repositories that contain spamming programs for the app Instagram. They reuse a lot of code that performs the critical functionality.
-
Cluster #6: These are 2 GitHub repositories and 1 source code attachment from an Underground Forum. They reuse code that performs UDP flooding. This is particularly interesting as it is not obvious that they are related projects.
-
Cluster #7: These are 3 GitHub repositories that contain tools for performing DOS attacks.
-
Cluster #8: These are 3 video game hacks for the game Apex Legends. They reuse code which gathers information from the game screen as well as aids the player in aiming.
-
Cluster #9: These are 3 GitHub repositories that contain ransomware and reuse code that encrypts files.
-
Cluster #10: These are 3 GitHub repositories which contain botnets. They reuse code that executes shellcode.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tereszkowski-Kaminski, M., Dash, S.K., Suarez-Tangil, G. (2024). A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums. In: Garcia-Alfaro, J., Kozik, R., Choraś, M., Katsikas, S. (eds) Computer Security – ESORICS 2024. ESORICS 2024. Lecture Notes in Computer Science, vol 14984. Springer, Cham. https://doi.org/10.1007/978-3-031-70896-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-70896-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70895-4
Online ISBN: 978-3-031-70896-1
eBook Packages: Computer ScienceComputer Science (R0)