ABSTRACT
Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that complete files that are individually authored can be attributed, these efforts have focused on ideal data sets such as the Google Code Jam data. We explore the problem of attribution "in the wild," examining source code obtained from open source version control systems, and investigate if and how such contributions can be attributed to their authors, either individually or on a per-account basis. In this work we show that accounts belonging to open source contributors containing short, incomplete, and typically uncompilable fragments can be effectively attributed.
- Leo Breiman. 2001. Random Forests. Machine Learning (2001). Google ScholarDigital Library
- Steven Burrows. 2010. Source code authorship attribution. Ph.D. Dissertation. RMIT University.Google Scholar
- Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15). 255--270. Google ScholarDigital Library
- Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P). Google ScholarDigital Library
Index Terms
- Git blame who?: stylistic authorship attribution of small, incomplete source code fragments
Recommendations
Source code authorship approaches natural language processing
CompSysTech '18: Proceedings of the 19th International Conference on Computer Systems and TechnologiesThis paper proposed method for source code authorship attribution using modern natural language processing methods. Our method based on text embedding with convolutional recurrent neural network reaches 94.5% accuracy within 500 authors in one dataset, ...
Source code authorship attribution using file embeddings
SPLASH Companion 2021: Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for HumanityThe problem of source code authorship attribution is crucial for a few reasons. Security and legal issues are the most popular ones. However, this domain could also help to understand the nature of the personal code style. This type of information could ...
AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework
ICCNS '22: Proceedings of the 2022 12th International Conference on Communication and Network SecuritySource Code Authorship Attribution (SCAA) is the technique to find the real author of source code in a corpus. Though it is a privacy threat to open-source programmers, it has shown to be significantly helpful in developing forensic-based applications ...
Comments