A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums

Tereszkowski-Kaminski, Michal; Dash, Santanu Kumar; Suarez-Tangil, Guillermo

doi:10.1007/978-3-031-70896-1_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14984))

Included in the following conference series:

European Symposium on Research in Computer Security

840 Accesses

Abstract

To date, most analysis of collaboration between malware authors has been performed on meta-data and compiled binaries, while ignoring artifacts present in the source code. We collect a vast amount of malicious source code from Underground Forums posts, Underground Forum code attachments, and GitHub repositories and devise a methodology that allows us to filter out most auxiliary code, leaving the measurement to focus on malicious code. We leverage this to perform an in-depth measurement of the reuse of malicious code between these malware centers as well as StackOverflow. We find that our methodology has high precision in identifying malicious code (93.1%) and provides a contemporary snapshot of malware code reuse across the Web, offering insights into the manners in which this takes place.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Usage and attribution of Stack Overflow code snippets in GitHub projects

Article 01 October 2018

Defending Against Package Typosquatting

Cross-project code clones in GitHub

Article 05 September 2018

Notes

1.
www.cambridgecybercrime.uk.

References

Baltes, S., Diehl, S.: Usage and attribution of stack overflow code snippets in github projects. Empir. Softw. Eng. 24(3), 1259–1295 (2019)
Article Google Scholar
Calleja, A., Tapiador, J., Caballero, J.: The malsource dataset: quantifying complexity and code reuse in malware development. IEEE Trans. Inf. Forensics Secur. 14(12), 3175–3190 (2018)
Article Google Scholar
Cheng, X., Jiang, L., Zhong, H., Yu, H., Zhao, J.: On the feasibility of detecting cross-platform code clones via identifier similarity. In: Proceedings of the 5th International Workshop on Software Mining, KDD, pp. 39–42 (2016)
Google Scholar
Islam, R., Rokon, M.O.F., Darki, A., Faloutsos, M.: Hackerscope: the dynamics of a massive hacker online ecosystem. arXiv preprint arXiv:2011.07222 (2020)
Moradi-Jamei, B., Kramer, B.L., Calderón, J.B.S., Korkmaz, G.: Community formation and detection on github collaboration networks. In: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, KDD, pp. 244–251 (2021)
Google Scholar
Nafi, K.W., Kar, T.S., Roy, B., Roy, C.K., Schneider, K.A.: CLCDSA: cross language code clone detection using syntactical features and API documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1026–1037. IEEE (2019)
Google Scholar
Nakagawa, T., Higo, Y., Kusumoto, S.: Nil: large-scale detection of large-variance clones. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 830–841 (2021)
Google Scholar
Pastrana, S., Thomas, D.R., Hutchings, A., Clayton, R.: Crimebb: enabling cybercrime research on underground forums at scale. In: Proceedings of the 2018 World Wide Web Conference, pp. 1845–1854 (2018)
Google Scholar
Qian, Y., Zhang, Y., Chawla, N., Ye, Y., Zhang, C.: Malicious repositories detection with adversarial heterogeneous graph contrastive learning. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1645–1654 (2022)
Google Scholar
Ragkhitwetsagul, C., Krinke, J., Clark, D.: A comparison of code similarity analysers. Empir. Softw. Eng. 23(4), 2464–2519 (2018)
Article Google Scholar
Ragkhitwetsagul, C., Krinke, J., Paixao, M., Bianco, G., Oliveto, R.: Toxic code snippets on stack overflow. IEEE Trans. Software Eng. 47(3), 560–581 (2019)
Article Google Scholar
Rokon, M.O.F., Islam, R., Darki, A., Papalexakis, E.E., Faloutsos, M.: Sourcefinder: finding malware source-code from publicly available repositories in github. In: 23rd International Symposium on Research in Attacks, Intrusions and Defenses ($\{$RAID$\}$ 2020), pp. 149–163 (2020)
Google Scholar
Rokon, M.O.F., Yan, P., Islam, R., Faloutsos, M.: Repo2vec: a comprehensive embedding approach for determining repository similarity. In: 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 355–365. IEEE (2021)
Google Scholar
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s Sch. Comput. TR 541(115), 64–68 (2007)
Google Scholar
Saini, V., Farmahinifarahani, F., Lu, Y., Baldi, P., Lopes, C.V.: Oreo: detection of clones in the twilight zone. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 354–365 (2018)
Google Scholar
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: Sourcerercc: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016)
Google Scholar
Svajlenko, J., Roy, C.K.: Cloneworks: a fast and flexible large-scale near-miss clone detection tool. In: ICSE (Companion Volume), pp. 177–179 (2017)
Google Scholar
Thomas, K., et al.: Framing dependencies introduced by underground commoditization. In: Proceedings of the Workshop on the Economics of Information Security (WEIS) (2015)
Google Scholar
Weaver, N., Paxson, V., Staniford, S., Cunningham, R.: Large scale malicious code: a research agenda (2003)
Google Scholar
Yahya, M.A., Kim, D.K.: CLCD-I: cross-language clone detection by using deep learning with infercode. Computers 12(1), 12 (2023)
Article Google Scholar
Yang, D., Martins, P., Saini, V., Lopes, C.: Stack overflow in github: any snippets there? In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 280–290. IEEE (2017)
Google Scholar
yoeo: Guesslang (2020). https://github.com/yoeo/guesslang

Download references

Acknowledgements

This project was funded by TED2021-132900A-I00, from the Spanish Ministry of Science and Innovation, with funds from MCIN/AEI /10.13039/501100011033, and the European Union-NextGenerationEU/PRTR; and by PID2022-143304OB-I00 funded by MCIN/AEI /10.13039/501100011033/ and the ERDF “A way of making Europe.” M. Tereszkowski-Kaminski’s work was supported by “Programa Investigo” grant 2022-C23.I01.P03.S0020-0000038, funded by the European Union NextGeneration-EU/PRTR and MITES/SEPE. G. Suarez-Tangil has been appointed as 2019 Ramon y Cajal fellow (RYC-2020-029401-I) funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future.

Author information

Authors and Affiliations

IMDEA Networks Institute, Leganés, Spain
Michal Tereszkowski-Kaminski & Guillermo Suarez-Tangil
University of Surrey, Guildford, UK
Santanu Kumar Dash

Authors

Michal Tereszkowski-Kaminski
View author publications
You can also search for this author in PubMed Google Scholar
Santanu Kumar Dash
View author publications
You can also search for this author in PubMed Google Scholar
Guillermo Suarez-Tangil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Tereszkowski-Kaminski .

Editor information

Editors and Affiliations

Institut Polytechnique de Paris, Palaiseau, France
Joaquin Garcia-Alfaro
Bydgoszcz University of Science and Technology, Bydgoszcz, Poland
Rafał Kozik
Bydgoszcz University of Science and Technology, Bydgoszcz, Poland
Michał Choraś
Norwegian University of Science and Technology - NTNU, Gjøvik, Norway
Sokratis Katsikas

Appendices

A Benign Datasets

These are used for finding code reuse between them and our corpus samples of that language during the Benign Function Filtering step of our methodology (refer to Sect. 4). Thus, we have 3 benign code datasets, one for each language category in the measurement. There is not a separate benign dataset for C on its own as the C and C++ measurements are done on the same samples.

50KC. This is a dataset of compilable Java projects from GitHub. 3,624 of these are included in our benign function filtering step.

Wild C++. This is a dataset of C++ function source code files gathered from GitHub repositories which contain C++ source code, queried for projects with at least 10 stars. 1,000,000 samples are included in our benign function filtering step. This is a fraction of the entire amount available, however we are limited by computational resources in clone detection.

Py150k. This is a dataset of 150k Python source files gathered from GitHub repositories. All 150,000 samples are included in our benign function filtering step.

B Prominent Measurement Clusters

1.1 B.1 C/C++ Clusters

We next describe the most prominent similarity matches in C/C++ code reuse:

Cluster #1: There are 1,781 UFSF repositories, 1,161 UF post threads, and 764 UF code attachments in the main supercluster. Within this supercluster, the nature and amount of reuse varies substantially. In places, there are localized subgraphs that have stronger connections to nodes within themselves than to the rest of the supercluster.
Cluster #2: These are 12 hacks for a videogame called DayZ. All 12 of these projects are code attachments from Underground Forums. They reuse a specific piece of malicious code which is a thread callback function.
Cluster #3: These are 11 hacks for a videogame called Grand Theft Auto 5. 10 of these are code attachments from Underground Forums, with the one remaining being a GitHub repository. They reuse a DLL-loading snippet.
Cluster #4: These are 9 Linux malware, 8 of them with Rootkit in the name, and all 9 GitHub repositories. The name of one of them suggests it was a homework assignment for a class. They reuse a piece of code that initializes the rootkit.
Cluster #5: These are 8 threads from an Underground Forum. Their code reuse centers around low-level memory manipulation with function names such as ModifyMemory() and WriteToMemory().
Cluster #6: These are 7 ransomware samples, all GitHub repositories, one of them being a collection repository for ransomware. The reuse present are functions that encrypt and decrypt files, as well as getting directory listings of files on the host system.
Cluster #7: These are 6 code attachments from Underground Forums, all 6 being hacks for the videogame Counter Strike: Global Offensive.
Cluster #8: These are 6 GitHub repositories dedicated to “hacking Windows memory”.
Cluster #9: These are 6 GitHub repositories which contain rootkits.
Cluster #10: These are 5 code attachments from Underground Forums which are videogame hacks. It appears they are different versions of the same hack, but we are unable to ascertain which videogame they target.

1.2 B.2 Java Clusters

We next describe the most prominent similarity matches seen in Java code reuse:

Cluster #1: There are 6 nodes in the network that come from Underground Forum snippets, making up 3.1% of the network. All of them are within the largest cluster. They deal with socket connections and user login functionality. Underground Forums and SourceFinder repositories on the other hand make up 89.8% of the network. These variants vary greatly in the malicious functionality they reuse, with some reusing socket connectivity code, and others network vulnerability scanning code. Still another sample obtained code from StackOverflow which would fake the working of threads using Thread.sleep() calls.
Cluster #2: The second largest cluster consists of 19 cryptocurrency miner GitHub repositories. The reuse centers around various functionality, from blockchain protocol implementations encryption implementations.
Cluster #3: The third largest cluster consists of 7 hacks for the video game Call of Duty: Modern Warfare 3, all existing in the network as source code attachments from Underground Forums.
Cluster #4-5: The fourth and fifth largest clusters both contain 4 projects each. One consists of 3 keyloggers and a Remote Administration Tool trojan with the code reused centered around keylogging activity. The other cluster consists of 4 blockchain implementations.
Cluster #6: This cluster consists of 3 forks of the same repository which contains miscellaneous hacking scripts by a hobbyist malware writer.
Cluster #7-9: The remaining clusters are pairs of nodes. We see a pair of hacks for the video game Realm of the Mad God, a pair of hacks for the video game Counter-Strike: Global Offensive, and a pair of GitHub repositories that collect malware samples.

1.3 B.3 Python Clusters

We next describe the most prominent similarity matches seen in Python code reuse:

Cluster #1: There are 1,615 UFSF repositories, 6 UF post threads and 12 UF code attachments in the main supercluster. 5 of these 12 are videogame hacks for the game PUBG. In places, there are localized subgraphs that have stronger connections to nodes within themselves than to the rest of the supercluster.
Cluster #2: These are 7 Remote Administration Tool GitHub repositories, 5 of them being versions of one and 2 of their versions of another. There is a big overlap between this cluster and Cluster #10 of C/C++ code. The code reuse includes writing and reading files and preparing to execute shellcode.
Cluster #3: These are 4 versions of one video game hack for the game Minecraft. They reuse code that deals with socket connections, among others.
Cluster #4: These are 4 GitHub repositories containing bitcoin miners.
Cluster #5: These are 3 GitHub repositories that contain spamming programs for the app Instagram. They reuse a lot of code that performs the critical functionality.
Cluster #6: These are 2 GitHub repositories and 1 source code attachment from an Underground Forum. They reuse code that performs UDP flooding. This is particularly interesting as it is not obvious that they are related projects.
Cluster #7: These are 3 GitHub repositories that contain tools for performing DOS attacks.
Cluster #8: These are 3 video game hacks for the game Apex Legends. They reuse code which gathers information from the game screen as well as aids the player in aiming.
Cluster #9: These are 3 GitHub repositories that contain ransomware and reuse code that encrypts files.
Cluster #10: These are 3 GitHub repositories which contain botnets. They reuse code that executes shellcode.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tereszkowski-Kaminski, M., Dash, S.K., Suarez-Tangil, G. (2024). A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums. In: Garcia-Alfaro, J., Kozik, R., Choraś, M., Katsikas, S. (eds) Computer Security – ESORICS 2024. ESORICS 2024. Lecture Notes in Computer Science, vol 14984. Springer, Cham. https://doi.org/10.1007/978-3-031-70896-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-70896-1_3
Published: 06 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70895-4
Online ISBN: 978-3-031-70896-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums