Abstract
The blame feature of version control systems is widely used—both by practitioners and researchers—to determine who has last modified a given line of code, and the commit where this contribution was made. The main disadvantage of blame is that, when a line is modified several times, it only shows the last commit that modified it—occluding previous changes to other areas of the same line. In this paper, we developed a method to increase the granularity of blame in git: instead of tracking lines of code, this method is capable of tracking tokens in source code. We evaluate its effectiveness with an empirical study in which we compare the accuracy of blame in git (per line) with our proposed blame-per-token method. We demonstrate that, in 5 large open source systems, blame-per-token is capable of properly identifying the commit that introduced a token with an accuracy between 94.5% and 99.2%, while blame-per-line can only achieve an accuracy between 75% and 91% (with a margin of error of +/-5% and a confidence interval of 95%). We also classify the reasons why either blame method fails, highlighting each method’s weaknesses. The blame-per-token method has been implemented in an open source tool called cregit, which is currently in use by the Linux Foundation to identify the persons who have contributed to the source code of the Linux kernel.
Similar content being viewed by others
Notes
cloc is an OSS tool to count lines of code and comments https://github.com/AlDanial/cloc
For this reason, other “diff” algorithms have been proposed, such as “patient diff” (originally implemented in the version control system Bazaar, and also implemented in git). Patient-diff tries to maximize the number of unique unchanged lines by repeatedly running Myers’ diff on sections of the input). For a discussion of its benefits, we refer elsewhere (Schindelin 2009).
References
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, pp 361–370
Asaduzzaman M, Roy CK, Schneider KA, Di Penta M (2013) Lhdiff: a language-independent hybrid approach for tracking source code lines. In: ICSM. IEEE Computer Society, pp 230–239
Asenov D, Guenat B, Müller P, Otth M (2017) Precise version control of trees with line-based version control systems. In: Huisman M, Rubin J (eds) Fundamental approaches to software engineering. Springer, Berlin, pp 152–169
Ayuso PN (2017) Frequently asked questions regarding gpl compliance and netfilter http://www.netfilter.org/licensing.html#faq
Bhattacharya P, Neamtiu I, Faloutsos M (2014) Determining developers’ expertise and role: a graph hierarchy-based approach. In: 2014 IEEE international conference on software maintenance and evolution, pp 11–20
Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1-3):217–239
Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 4–14
Canfora G, Cerulo L, Di Penta M (2009) Tracking your changes: a language-independent approach. IEEE Soft 26(1):50–57
Chacon S, Straub B (2014) Pro git, 2nd edn. APres
Chacon S, Straub B (2014) Pro git, 2nd edn. Apress, Berkely
Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, SIGMOD ’96. ACM, New York, pp 493–504
Cochran WG (1963) Sampling techniques, 2nd edn. Wiley, New York
Collard ML, Decker MJ, Maletic JI (2011) Lightweight transformation and fact extraction with the srcml toolkit. In: 2011 IEEE 11th international working conference on source code analysis and manipulation, pp 173–184
Collard ML, Decker MJ, Maletic JI (2013) srcml: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. In: 2013 IEEE international conference on software maintenance, pp 516–519
Davies J, German DM, Godfrey MW, Hindle A (2011) Software bertillonage: finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. ACM, New York, pp 183–192
Dotzler G, Philippsen M (2016) Move-optimized source code tree differencing. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, ASE 2016. ACM, New York, pp 660–671
Falleri JR, Morandat F, Blanc X, Martinez M, Monperrus M (2014) Fine-grained and accurate source code differencing. In: ACM/IEEE international conference on automated software engineering, ASE’14, Vasteras, Sweden - September 15 - 19, 2014, pp 313–324
Feist MD, Santos EA, Watts I, Hindle A (2016) Visualizing project evolution through abstract syntax tree analysis. In: 2016 IEEE working conference on software visualization, VISSOFT 2016, Raleigh, NC, USA, October 3-4, 2016, pp 11–20
Fluri B, Wuersch M, PInzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33 (11):725–743
Fritz T, Murphy GC, Hill E (2007) Does a programmer’s activity indicate knowledge of code?. In: Proceedings of the the 6th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC-FSE ’07. ACM, New York, pp 341–350
Fritz T, Ou J, Murphy GC, Murphy-Hill E (2010) A degree-of-knowledge model to capture source code familiarity. In: Proceedings of the 32Nd ACM/IEEE international conference on software engineering - volume 1, ICSE ’10. ACM, New York, pp 385–394
German DM (2006) A study of the contributors of postgresql. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06, pp 163–164
German DM, Hassan AE, Robles G (2009) Change impact graphs: determining the impact of prior codechanges. Inf Softw Technol 51(10):1394–1408
Girba T, Kuhn A, Seeberger M, Ducasse S (2005) How developers drive software evolution. In: Eighth international workshop on principles of software evolution (IWPSE’05), pp 113–122
Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181
Hashimoto M, Mori A (2008) Diff/ts: a tool for fine-grained structural change analysis. In: Proceedings of the 2008 15th working conference on reverse engineering, WCRE ’08. IEEE Computer Society, Washington, pp 279–288
Hassan AE (2009) Predicting faults using the complexity of code changes. In: 2009 IEEE 31st international conference on software engineering, pp 78–88
Hassan AE, Holt RC (2004) C-REX: an evolutionary code extractor for C - (PDF). Technical report, University of Waterloo. http://plg.uwaterloo.ca/~aeehassa/home/pubs/crex.pdf
Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: 2012 34th international conference on software engineering (ICSE), pp 200–210
Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: Proceedings of the 34th international conference on software engineering, ICSE ’12. IEEE Press, Piscataway, pp 200–210
Hattori LP, Lanza M, Robbes R (2012) Refining code ownership with synchronous changes. Empirical Softw Engg 17(4-5):467–499
Higo Y, Ohtani A, Kusumoto S (2017) Generating simpler ast edit scripts by considering copy-and-paste. In: Proceedings of the 32Nd IEEE/ACM international conference on automated software engineering, ASE 2017. IEEE Press, Piscataway, pp 532–542
Ihara A, Kamei Y, Ohira M, Hassan AE, Ubayashi N, Matsumoto K (2014) Early identification of future committers in open source software projects. In: Proceedings of the 2014 14th international conference on quality software, QSIC ’14. IEEE Computer Society, Washington, pp 47–56
Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 351–360
Khan S (2018) Who made that change and when: using cregit for debugging http://www.gonehiking.org/ShuahLinuxBlogs/blog/2018/10/18/who-made-that-change-and-when-using-cregit-for-debugging/
Kim M, Notkin D (2009) Discovering and representing systematic code changes. In: Proceedings of the 31st international conference on software engineering, ICSE ’09. IEEE Computer Society, Washington, pp 309–319
Kim S, Zimmermann T, Pan K, Whitehead Jr EJ (2006) Automatic identification of bug-introducing changes. In: Proceedings of the 21st IEEE/ACM international conference on automated software engineering, ASE ’06. IEEE Computer Society, Washington, pp 81–90
Ma D, Schuler D, Zimmermann T, Sillito J (2009) Expert recommendation with usage expertise. In: 2009 IEEE international conference on software maintenance, pp 535–538
Macho C, Mcintosh S, Pinzger M (2017) Extracting build changes with builddiff. In: Proceedings of the 14th international conference on mining software repositories, MSR ’17. IEEE Press, Piscataway, pp 368–378
McDonald DW, Ackerman MS (2000) Expertise recommender: a flexible recommendation system and architecture. In: Proceedings of the 2000 ACM conference on computer supported cooperative work, CSCW ’00. ACM, New York, pp 231–240
Meeker H (2017) Patrick mchardy and copyright profiteering. Open source https://opensource.com/article/17/8/patrick-mchardy-and-copyright-profiteering
Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259
Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259
Miller W, Myers EW (1985) A file comparison program. Soft Practice Exp 15(11):1025–1040
Minto S, Murphy GC (2007) Recommending emergent teams. In: Proceedings of the fourth international workshop on mining software repositories, MSR ’07. IEEE Computer Society, Washington, pp 5–
Miraldo VC, Dagand P-É, Swierstra W (2017) Type-directed diffing of structured data. In: Proceedings of the 2nd ACM SIGPLAN international workshop on type-driven development, TyDe 2017. ACM, New York, pp 2–15
Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, ICSE ’02. ACM, New York, pp 503–512
Myers EW (1986) Ano(nd) difference algorithm and its variations. Algorithmica 1(1):251–266
Palix N, Falleri J-R, Lawall J (2015) Improving pattern tracking with a language-aware tree differencing algorithm. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015 Montreal, QC, Canada, March 2-6, 2015, pp 43–52
Panciera K, Halfaker A, Terveen L (2009) Wikipedians are born, not made: a study of power editors on wikipedia. In: Proceedings of the ACM 2009 international conference on supporting group work, GROUP ’09. ACM, New York, pp 51–60
Raghavan S, Rohana R, Leon D, Podgurski A, Augustine V (2004) Dex: a semantic-graph differencing tool for studying changes in large code bases. In: Proceedings of the 20th IEEE international conference on software maintenance, ICSM ’04. IEEE Computer Society, Washington, pp 188–197
Rahman F, Devanbu P (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 491–500
Reiss SP (2008) Tracking source locations. In: Proceedings of the 30th international conference on software engineering, ICSE ’08. ACM, New York, pp 11–20
Schindelin J (2009) [patch 0/3] teach git about the patience diff algorithm. https://marc.info/?l=git&m=123082787502576&w=2
Schuler D, Zimmermann T (2008) Mining usage expertise from version archives. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. ACM, New York, pp 121–124
Servant F, Jones JA (2012) History slicing: assisting code-evolution tasks. In: Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12. ACM, New York, pp 43:1–43:11
Servant F, Jones JA (2017) Fuzzy fine-grained code-history analysis. In: Proceedings of the 39th international conference on software engineering, ICSE ’17. IEEE Press, Piscataway, pp 746–757
Sharwood S (2017) Linux kernel community tries to castrate GPL copyright troll. The register https://www.theregister.co.uk/2017/10/18/linux_kernel_community_enforcement_statement/
Shihab E, Mockus A, Kamei Y, Adams B, Hassan AE (2011) High-impact defects: a study of breakage and surprise defects. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 300–310
Spacco J, Williams C (2009) Lightweight techniques for tracking unique program statements. In: 2009 Ninth IEEE international working conference on source code analysis and manipulation, pp 99–108
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1, pp 812–823
The Linux Foundation (2017) Linux foundation and free software foundation europe introduce resources to support open source software license identification and compliance https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/ https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th international conference on software engineering, ICSE ’16. ACM, New York, pp 1039–1050
Tsantalis N, Mansouri M, Eshkevari L, Mazinanian D, Dig D (2018) Accurate and efficient refactoring detection in commit history. In: Proceedings of the 40th international conference on software engineering, ICSE 2018
Tsikerdekis M (2018) Persistent code contribution: a ranking algorithm for code contribution in crowdsourced software. J Empir Softw Eng archive 23(4):1871–1894
Ukkonen E (1985) Algorithms for approximate string matching. Inf Control 64 (1):100–118. International Conference on Foundations of Computation Theory
Weissgerber P, Diehl S (2006) Identifying refactorings from source-code changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06), pp 231–240
Welte H (2018) Report from the Geniatech vs. mchardy GPL violation court hearing http://laforge.gnumonks.org/blog/20180307-mchardy-gpl/
Xing Z, Stroulia E (2005) Umldiff: an algorithm for object-oriented design differencing. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering, ASE ’05. ACM, New York, pp 54–65
Ye Y, Kishida K (2003) Toward an understanding of the motivation open source software developers. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, pp 419–429
Zhou M, Chen Q, Mockus A, Wu F (2017) On the scalability of linux kernel maintainers’ work. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering, ESEC/FSE 2017. ACM, New York, pp 27–37
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Romain Robbes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
German, D.M., Adams, B. & Stewart, K. cregit: Token-level blame information in git version control repositories. Empir Software Eng 24, 2725–2763 (2019). https://doi.org/10.1007/s10664-019-09704-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09704-x