Skip to main content

A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums

  • Conference paper
  • First Online:
Computer Security – ESORICS 2024 (ESORICS 2024)

Abstract

To date, most analysis of collaboration between malware authors has been performed on meta-data and compiled binaries, while ignoring artifacts present in the source code. We collect a vast amount of malicious source code from Underground Forums posts, Underground Forum code attachments, and GitHub repositories and devise a methodology that allows us to filter out most auxiliary code, leaving the measurement to focus on malicious code. We leverage this to perform an in-depth measurement of the reuse of malicious code between these malware centers as well as StackOverflow. We find that our methodology has high precision in identifying malicious code (93.1%) and provides a contemporary snapshot of malware code reuse across the Web, offering insights into the manners in which this takes place.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    www.cambridgecybercrime.uk.

References

  1. Baltes, S., Diehl, S.: Usage and attribution of stack overflow code snippets in github projects. Empir. Softw. Eng. 24(3), 1259–1295 (2019)

    Article  Google Scholar 

  2. Calleja, A., Tapiador, J., Caballero, J.: The malsource dataset: quantifying complexity and code reuse in malware development. IEEE Trans. Inf. Forensics Secur. 14(12), 3175–3190 (2018)

    Article  Google Scholar 

  3. Cheng, X., Jiang, L., Zhong, H., Yu, H., Zhao, J.: On the feasibility of detecting cross-platform code clones via identifier similarity. In: Proceedings of the 5th International Workshop on Software Mining, KDD, pp. 39–42 (2016)

    Google Scholar 

  4. Islam, R., Rokon, M.O.F., Darki, A., Faloutsos, M.: Hackerscope: the dynamics of a massive hacker online ecosystem. arXiv preprint arXiv:2011.07222 (2020)

  5. Moradi-Jamei, B., Kramer, B.L., Calderón, J.B.S., Korkmaz, G.: Community formation and detection on github collaboration networks. In: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, KDD, pp. 244–251 (2021)

    Google Scholar 

  6. Nafi, K.W., Kar, T.S., Roy, B., Roy, C.K., Schneider, K.A.: CLCDSA: cross language code clone detection using syntactical features and API documentation. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1026–1037. IEEE (2019)

    Google Scholar 

  7. Nakagawa, T., Higo, Y., Kusumoto, S.: Nil: large-scale detection of large-variance clones. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 830–841 (2021)

    Google Scholar 

  8. Pastrana, S., Thomas, D.R., Hutchings, A., Clayton, R.: Crimebb: enabling cybercrime research on underground forums at scale. In: Proceedings of the 2018 World Wide Web Conference, pp. 1845–1854 (2018)

    Google Scholar 

  9. Qian, Y., Zhang, Y., Chawla, N., Ye, Y., Zhang, C.: Malicious repositories detection with adversarial heterogeneous graph contrastive learning. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1645–1654 (2022)

    Google Scholar 

  10. Ragkhitwetsagul, C., Krinke, J., Clark, D.: A comparison of code similarity analysers. Empir. Softw. Eng. 23(4), 2464–2519 (2018)

    Article  Google Scholar 

  11. Ragkhitwetsagul, C., Krinke, J., Paixao, M., Bianco, G., Oliveto, R.: Toxic code snippets on stack overflow. IEEE Trans. Software Eng. 47(3), 560–581 (2019)

    Article  Google Scholar 

  12. Rokon, M.O.F., Islam, R., Darki, A., Papalexakis, E.E., Faloutsos, M.: Sourcefinder: finding malware source-code from publicly available repositories in github. In: 23rd International Symposium on Research in Attacks, Intrusions and Defenses (\(\{\)RAID\(\}\) 2020), pp. 149–163 (2020)

    Google Scholar 

  13. Rokon, M.O.F., Yan, P., Islam, R., Faloutsos, M.: Repo2vec: a comprehensive embedding approach for determining repository similarity. In: 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 355–365. IEEE (2021)

    Google Scholar 

  14. Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s Sch. Comput. TR 541(115), 64–68 (2007)

    Google Scholar 

  15. Saini, V., Farmahinifarahani, F., Lu, Y., Baldi, P., Lopes, C.V.: Oreo: detection of clones in the twilight zone. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 354–365 (2018)

    Google Scholar 

  16. Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: Sourcerercc: scaling code clone detection to big-code. In: Proceedings of the 38th International Conference on Software Engineering, pp. 1157–1168 (2016)

    Google Scholar 

  17. Svajlenko, J., Roy, C.K.: Cloneworks: a fast and flexible large-scale near-miss clone detection tool. In: ICSE (Companion Volume), pp. 177–179 (2017)

    Google Scholar 

  18. Thomas, K., et al.: Framing dependencies introduced by underground commoditization. In: Proceedings of the Workshop on the Economics of Information Security (WEIS) (2015)

    Google Scholar 

  19. Weaver, N., Paxson, V., Staniford, S., Cunningham, R.: Large scale malicious code: a research agenda (2003)

    Google Scholar 

  20. Yahya, M.A., Kim, D.K.: CLCD-I: cross-language clone detection by using deep learning with infercode. Computers 12(1), 12 (2023)

    Article  Google Scholar 

  21. Yang, D., Martins, P., Saini, V., Lopes, C.: Stack overflow in github: any snippets there? In: 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), pp. 280–290. IEEE (2017)

    Google Scholar 

  22. yoeo: Guesslang (2020). https://github.com/yoeo/guesslang

Download references

Acknowledgements

This project was funded by TED2021-132900A-I00, from the Spanish Ministry of Science and Innovation, with funds from MCIN/AEI /10.13039/501100011033, and the European Union-NextGenerationEU/PRTR; and by PID2022-143304OB-I00 funded by MCIN/AEI /10.13039/501100011033/ and the ERDF “A way of making Europe.” M. Tereszkowski-Kaminski’s work was supported by “Programa Investigo” grant 2022-C23.I01.P03.S0020-0000038, funded by the European Union NextGeneration-EU/PRTR and MITES/SEPE. G. Suarez-Tangil has been appointed as 2019 Ramon y Cajal fellow (RYC-2020-029401-I) funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Tereszkowski-Kaminski .

Editor information

Editors and Affiliations

Appendices

A Benign Datasets

These are used for finding code reuse between them and our corpus samples of that language during the Benign Function Filtering step of our methodology (refer to Sect. 4). Thus, we have 3 benign code datasets, one for each language category in the measurement. There is not a separate benign dataset for C on its own as the C and C++ measurements are done on the same samples.

50KC. This is a dataset of compilable Java projects from GitHub. 3,624 of these are included in our benign function filtering step.

Wild C++. This is a dataset of C++ function source code files gathered from GitHub repositories which contain C++ source code, queried for projects with at least 10 stars. 1,000,000 samples are included in our benign function filtering step. This is a fraction of the entire amount available, however we are limited by computational resources in clone detection.

Py150k. This is a dataset of 150k Python source files gathered from GitHub repositories. All 150,000 samples are included in our benign function filtering step.

B Prominent Measurement Clusters

1.1 B.1 C/C++ Clusters

We next describe the most prominent similarity matches in C/C++ code reuse:

  • Cluster #1: There are 1,781 UFSF repositories, 1,161 UF post threads, and 764 UF code attachments in the main supercluster. Within this supercluster, the nature and amount of reuse varies substantially. In places, there are localized subgraphs that have stronger connections to nodes within themselves than to the rest of the supercluster.

  • Cluster #2: These are 12 hacks for a videogame called DayZ. All 12 of these projects are code attachments from Underground Forums. They reuse a specific piece of malicious code which is a thread callback function.

  • Cluster #3: These are 11 hacks for a videogame called Grand Theft Auto 5. 10 of these are code attachments from Underground Forums, with the one remaining being a GitHub repository. They reuse a DLL-loading snippet.

  • Cluster #4: These are 9 Linux malware, 8 of them with Rootkit in the name, and all 9 GitHub repositories. The name of one of them suggests it was a homework assignment for a class. They reuse a piece of code that initializes the rootkit.

  • Cluster #5: These are 8 threads from an Underground Forum. Their code reuse centers around low-level memory manipulation with function names such as ModifyMemory() and WriteToMemory().

  • Cluster #6: These are 7 ransomware samples, all GitHub repositories, one of them being a collection repository for ransomware. The reuse present are functions that encrypt and decrypt files, as well as getting directory listings of files on the host system.

  • Cluster #7: These are 6 code attachments from Underground Forums, all 6 being hacks for the videogame Counter Strike: Global Offensive.

  • Cluster #8: These are 6 GitHub repositories dedicated to “hacking Windows memory”.

  • Cluster #9: These are 6 GitHub repositories which contain rootkits.

  • Cluster #10: These are 5 code attachments from Underground Forums which are videogame hacks. It appears they are different versions of the same hack, but we are unable to ascertain which videogame they target.

1.2 B.2 Java Clusters

We next describe the most prominent similarity matches seen in Java code reuse:

  • Cluster #1: There are 6 nodes in the network that come from Underground Forum snippets, making up 3.1% of the network. All of them are within the largest cluster. They deal with socket connections and user login functionality. Underground Forums and SourceFinder repositories on the other hand make up 89.8% of the network. These variants vary greatly in the malicious functionality they reuse, with some reusing socket connectivity code, and others network vulnerability scanning code. Still another sample obtained code from StackOverflow which would fake the working of threads using Thread.sleep() calls.

  • Cluster #2: The second largest cluster consists of 19 cryptocurrency miner GitHub repositories. The reuse centers around various functionality, from blockchain protocol implementations encryption implementations.

  • Cluster #3: The third largest cluster consists of 7 hacks for the video game Call of Duty: Modern Warfare 3, all existing in the network as source code attachments from Underground Forums.

  • Cluster #4-5: The fourth and fifth largest clusters both contain 4 projects each. One consists of 3 keyloggers and a Remote Administration Tool trojan with the code reused centered around keylogging activity. The other cluster consists of 4 blockchain implementations.

  • Cluster #6: This cluster consists of 3 forks of the same repository which contains miscellaneous hacking scripts by a hobbyist malware writer.

  • Cluster #7-9: The remaining clusters are pairs of nodes. We see a pair of hacks for the video game Realm of the Mad God, a pair of hacks for the video game Counter-Strike: Global Offensive, and a pair of GitHub repositories that collect malware samples.

1.3 B.3 Python Clusters

We next describe the most prominent similarity matches seen in Python code reuse:

  • Cluster #1: There are 1,615 UFSF repositories, 6 UF post threads and 12 UF code attachments in the main supercluster. 5 of these 12 are videogame hacks for the game PUBG. In places, there are localized subgraphs that have stronger connections to nodes within themselves than to the rest of the supercluster.

  • Cluster #2: These are 7 Remote Administration Tool GitHub repositories, 5 of them being versions of one and 2 of their versions of another. There is a big overlap between this cluster and Cluster #10 of C/C++ code. The code reuse includes writing and reading files and preparing to execute shellcode.

  • Cluster #3: These are 4 versions of one video game hack for the game Minecraft. They reuse code that deals with socket connections, among others.

  • Cluster #4: These are 4 GitHub repositories containing bitcoin miners.

  • Cluster #5: These are 3 GitHub repositories that contain spamming programs for the app Instagram. They reuse a lot of code that performs the critical functionality.

  • Cluster #6: These are 2 GitHub repositories and 1 source code attachment from an Underground Forum. They reuse code that performs UDP flooding. This is particularly interesting as it is not obvious that they are related projects.

  • Cluster #7: These are 3 GitHub repositories that contain tools for performing DOS attacks.

  • Cluster #8: These are 3 video game hacks for the game Apex Legends. They reuse code which gathers information from the game screen as well as aids the player in aiming.

  • Cluster #9: These are 3 GitHub repositories that contain ransomware and reuse code that encrypts files.

  • Cluster #10: These are 3 GitHub repositories which contain botnets. They reuse code that executes shellcode.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tereszkowski-Kaminski, M., Dash, S.K., Suarez-Tangil, G. (2024). A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums. In: Garcia-Alfaro, J., Kozik, R., Choraś, M., Katsikas, S. (eds) Computer Security – ESORICS 2024. ESORICS 2024. Lecture Notes in Computer Science, vol 14984. Springer, Cham. https://doi.org/10.1007/978-3-031-70896-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70896-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70895-4

  • Online ISBN: 978-3-031-70896-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics