Git blame who?: stylistic authorship attribution of small, incomplete source code fragments

Authors:
Edwin Dauber

Drexel University

Drexel University
View Profile

,
Aylin Caliskan

Princeton University

Princeton University
View Profile

,
Richard Harang

Sophos Data Science Team

Sophos Data Science Team
View Profile

,
Rachel Greenstadt

Drexel University

Drexel University
View Profile

ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion ProceeedingsMay 2018Pages 356–357https://doi.org/10.1145/3183440.3195007

Published:27 May 2018Publication History

ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings

Pages 356–357

ABSTRACT

Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that complete files that are individually authored can be attributed, these efforts have focused on ideal data sets such as the Google Code Jam data. We explore the problem of attribution "in the wild," examining source code obtained from open source version control systems, and investigate if and how such contributions can be attributed to their authors, either individually or on a per-account basis. In this work we show that accounts belonging to open source contributors containing short, incomplete, and typically uncompilable fragments can be effectively attributed.

References

Leo Breiman. 2001. Random Forests. Machine Learning (2001). Google ScholarDigital Library
Steven Burrows. 2010. Source code authorship attribution. Ph.D. Dissertation. RMIT University.Google Scholar
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15). 255--270. Google ScholarDigital Library
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. In Proc. of IEEE Symposium on Security and Privacy (S&P). Google ScholarDigital Library

Index Terms

Git blame who?: stylistic authorship attribution of small, incomplete source code fragments
1. Security and privacy
  1. Security services
    1. Pseudonymity, anonymity and untraceability

Recommendations

Source code authorship approaches natural language processing
CompSysTech '18: Proceedings of the 19th International Conference on Computer Systems and Technologies

This paper proposed method for source code authorship attribution using modern natural language processing methods. Our method based on text embedding with convolutional recurrent neural network reaches 94.5% accuracy within 500 authors in one dataset, ...
Read More
Source code authorship attribution using file embeddings
SPLASH Companion 2021: Companion Proceedings of the 2021 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity

The problem of source code authorship attribution is crucial for a few reasons. Security and legal issues are the most popular ones. However, this domain could also help to understand the nature of the personal code style. This type of information could ...
Read More
AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework
ICCNS '22: Proceedings of the 2022 12th International Conference on Communication and Network Security

Source Code Authorship Attribution (SCAA) is the technique to find the real author of source code in a corpus. Though it is a privacy threat to open-source programmers, it has shown to be significantly helpful in developing forensic-based applications ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings
May 2018
231 pages
ISBN:9781450356633
DOI:10.1145/3183440
Conference Chair:
Michel Chaudron
Chalmers University of Technology, University of Gothenburg, Sweden
,
General Chair:
Ivica Crnkovic
Chalmers University of Technology, University of Gothenburg, Sweden
,
Program Chairs:
Marsha Chechik
University of Toronto, Canada
,
Mark Harman
Facebook and University College London, United Kingdom
Copyright © 2018 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Check for updates
Author Tags
machine learning
source code authorship attribution
stylometry
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 389
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Git blame who?: stylistic authorship attribution of small, incomplete source code fragments

ICSE '18: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings

ABSTRACT

References

Cited By

Index Terms

Recommendations

Source code authorship approaches natural language processing

Source code authorship attribution using file embeddings

AuthAttLyzer: A Robust defensive distillation-based Authorship Attribution framework