Skip to main content
Log in

cregit: Token-level blame information in git version control repositories

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

The blame feature of version control systems is widely used—both by practitioners and researchers—to determine who has last modified a given line of code, and the commit where this contribution was made. The main disadvantage of blame is that, when a line is modified several times, it only shows the last commit that modified it—occluding previous changes to other areas of the same line. In this paper, we developed a method to increase the granularity of blame in git: instead of tracking lines of code, this method is capable of tracking tokens in source code. We evaluate its effectiveness with an empirical study in which we compare the accuracy of blame in git (per line) with our proposed blame-per-token method. We demonstrate that, in 5 large open source systems, blame-per-token is capable of properly identifying the commit that introduced a token with an accuracy between 94.5% and 99.2%, while blame-per-line can only achieve an accuracy between 75% and 91% (with a margin of error of +/-5% and a confidence interval of 95%). We also classify the reasons why either blame method fails, highlighting each method’s weaknesses. The blame-per-token method has been implemented in an open source tool called cregit, which is currently in use by the Linux Foundation to identify the persons who have contributed to the source code of the Linux kernel.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. https://rtyley.github.io/bfg-repo-cleaner/

  2. cloc is an OSS tool to count lines of code and comments https://github.com/AlDanial/cloc

  3. For this reason, other “diff” algorithms have been proposed, such as “patient diff” (originally implemented in the version control system Bazaar, and also implemented in git). Patient-diff tries to maximize the number of unique unchanged lines by repeatedly running Myers’ diff on sections of the input). For a discussion of its benefits, we refer elsewhere (Schindelin 2009).

  4. https://github.com/git/git/commit/c9018b0305a56436c85b292edbeacff04b0ebb5d

  5. http://turingmachine.org/2018/cregit

  6. http://github.com/cregit/evaluation

  7. https://github.com/GumTreeDiff/gumtree

  8. https://blogs.s-osg.org/made-change-using-cregit-debugging/

  9. http://github.com/cregit

  10. http://cregit.linuxsources.org

References

  • Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, pp 361–370

  • Asaduzzaman M, Roy CK, Schneider KA, Di Penta M (2013) Lhdiff: a language-independent hybrid approach for tracking source code lines. In: ICSM. IEEE Computer Society, pp 230–239

  • Asenov D, Guenat B, Müller P, Otth M (2017) Precise version control of trees with line-based version control systems. In: Huisman M, Rubin J (eds) Fundamental approaches to software engineering. Springer, Berlin, pp 152–169

  • Ayuso PN (2017) Frequently asked questions regarding gpl compliance and netfilter http://www.netfilter.org/licensing.html#faq

  • Bhattacharya P, Neamtiu I, Faloutsos M (2014) Determining developers’ expertise and role: a graph hierarchy-based approach. In: 2014 IEEE international conference on software maintenance and evolution, pp 11–20

  • Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1-3):217–239

    Article  MathSciNet  MATH  Google Scholar 

  • Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 4–14

  • Canfora G, Cerulo L, Di Penta M (2009) Tracking your changes: a language-independent approach. IEEE Soft 26(1):50–57

    Article  Google Scholar 

  • Chacon S, Straub B (2014) Pro git, 2nd edn. APres

  • Chacon S, Straub B (2014) Pro git, 2nd edn. Apress, Berkely

    Book  Google Scholar 

  • Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, SIGMOD ’96. ACM, New York, pp 493–504

  • Cochran WG (1963) Sampling techniques, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Collard ML, Decker MJ, Maletic JI (2011) Lightweight transformation and fact extraction with the srcml toolkit. In: 2011 IEEE 11th international working conference on source code analysis and manipulation, pp 173–184

  • Collard ML, Decker MJ, Maletic JI (2013) srcml: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. In: 2013 IEEE international conference on software maintenance, pp 516–519

  • Davies J, German DM, Godfrey MW, Hindle A (2011) Software bertillonage: finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. ACM, New York, pp 183–192

  • Dotzler G, Philippsen M (2016) Move-optimized source code tree differencing. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, ASE 2016. ACM, New York, pp 660–671

  • Falleri JR, Morandat F, Blanc X, Martinez M, Monperrus M (2014) Fine-grained and accurate source code differencing. In: ACM/IEEE international conference on automated software engineering, ASE’14, Vasteras, Sweden - September 15 - 19, 2014, pp 313–324

  • Feist MD, Santos EA, Watts I, Hindle A (2016) Visualizing project evolution through abstract syntax tree analysis. In: 2016 IEEE working conference on software visualization, VISSOFT 2016, Raleigh, NC, USA, October 3-4, 2016, pp 11–20

  • Fluri B, Wuersch M, PInzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33 (11):725–743

    Article  Google Scholar 

  • Fritz T, Murphy GC, Hill E (2007) Does a programmer’s activity indicate knowledge of code?. In: Proceedings of the the 6th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC-FSE ’07. ACM, New York, pp 341–350

  • Fritz T, Ou J, Murphy GC, Murphy-Hill E (2010) A degree-of-knowledge model to capture source code familiarity. In: Proceedings of the 32Nd ACM/IEEE international conference on software engineering - volume 1, ICSE ’10. ACM, New York, pp 385–394

  • German DM (2006) A study of the contributors of postgresql. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06, pp 163–164

  • German DM, Hassan AE, Robles G (2009) Change impact graphs: determining the impact of prior codechanges. Inf Softw Technol 51(10):1394–1408

    Article  Google Scholar 

  • Girba T, Kuhn A, Seeberger M, Ducasse S (2005) How developers drive software evolution. In: Eighth international workshop on principles of software evolution (IWPSE’05), pp 113–122

  • Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181

    Article  Google Scholar 

  • Hashimoto M, Mori A (2008) Diff/ts: a tool for fine-grained structural change analysis. In: Proceedings of the 2008 15th working conference on reverse engineering, WCRE ’08. IEEE Computer Society, Washington, pp 279–288

  • Hassan AE (2009) Predicting faults using the complexity of code changes. In: 2009 IEEE 31st international conference on software engineering, pp 78–88

  • Hassan AE, Holt RC (2004) C-REX: an evolutionary code extractor for C - (PDF). Technical report, University of Waterloo. http://plg.uwaterloo.ca/~aeehassa/home/pubs/crex.pdf

  • Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: 2012 34th international conference on software engineering (ICSE), pp 200–210

  • Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: Proceedings of the 34th international conference on software engineering, ICSE ’12. IEEE Press, Piscataway, pp 200–210

  • Hattori LP, Lanza M, Robbes R (2012) Refining code ownership with synchronous changes. Empirical Softw Engg 17(4-5):467–499

    Article  Google Scholar 

  • Higo Y, Ohtani A, Kusumoto S (2017) Generating simpler ast edit scripts by considering copy-and-paste. In: Proceedings of the 32Nd IEEE/ACM international conference on automated software engineering, ASE 2017. IEEE Press, Piscataway, pp 532–542

  • Ihara A, Kamei Y, Ohira M, Hassan AE, Ubayashi N, Matsumoto K (2014) Early identification of future committers in open source software projects. In: Proceedings of the 2014 14th international conference on quality software, QSIC ’14. IEEE Computer Society, Washington, pp 47–56

  • Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 351–360

  • Khan S (2018) Who made that change and when: using cregit for debugging http://www.gonehiking.org/ShuahLinuxBlogs/blog/2018/10/18/who-made-that-change-and-when-using-cregit-for-debugging/

  • Kim M, Notkin D (2009) Discovering and representing systematic code changes. In: Proceedings of the 31st international conference on software engineering, ICSE ’09. IEEE Computer Society, Washington, pp 309–319

  • Kim S, Zimmermann T, Pan K, Whitehead Jr EJ (2006) Automatic identification of bug-introducing changes. In: Proceedings of the 21st IEEE/ACM international conference on automated software engineering, ASE ’06. IEEE Computer Society, Washington, pp 81–90

  • Ma D, Schuler D, Zimmermann T, Sillito J (2009) Expert recommendation with usage expertise. In: 2009 IEEE international conference on software maintenance, pp 535–538

  • Macho C, Mcintosh S, Pinzger M (2017) Extracting build changes with builddiff. In: Proceedings of the 14th international conference on mining software repositories, MSR ’17. IEEE Press, Piscataway, pp 368–378

  • McDonald DW, Ackerman MS (2000) Expertise recommender: a flexible recommendation system and architecture. In: Proceedings of the 2000 ACM conference on computer supported cooperative work, CSCW ’00. ACM, New York, pp 231–240

  • Meeker H (2017) Patrick mchardy and copyright profiteering. Open source https://opensource.com/article/17/8/patrick-mchardy-and-copyright-profiteering

  • Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259

  • Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259

  • Miller W, Myers EW (1985) A file comparison program. Soft Practice Exp 15(11):1025–1040

    Article  Google Scholar 

  • Minto S, Murphy GC (2007) Recommending emergent teams. In: Proceedings of the fourth international workshop on mining software repositories, MSR ’07. IEEE Computer Society, Washington, pp 5–

  • Miraldo VC, Dagand P-É, Swierstra W (2017) Type-directed diffing of structured data. In: Proceedings of the 2nd ACM SIGPLAN international workshop on type-driven development, TyDe 2017. ACM, New York, pp 2–15

  • Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, ICSE ’02. ACM, New York, pp 503–512

  • Myers EW (1986) Ano(nd) difference algorithm and its variations. Algorithmica 1(1):251–266

    Article  MathSciNet  MATH  Google Scholar 

  • Palix N, Falleri J-R, Lawall J (2015) Improving pattern tracking with a language-aware tree differencing algorithm. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015 Montreal, QC, Canada, March 2-6, 2015, pp 43–52

  • Panciera K, Halfaker A, Terveen L (2009) Wikipedians are born, not made: a study of power editors on wikipedia. In: Proceedings of the ACM 2009 international conference on supporting group work, GROUP ’09. ACM, New York, pp 51–60

  • Raghavan S, Rohana R, Leon D, Podgurski A, Augustine V (2004) Dex: a semantic-graph differencing tool for studying changes in large code bases. In: Proceedings of the 20th IEEE international conference on software maintenance, ICSM ’04. IEEE Computer Society, Washington, pp 188–197

  • Rahman F, Devanbu P (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 491–500

  • Reiss SP (2008) Tracking source locations. In: Proceedings of the 30th international conference on software engineering, ICSE ’08. ACM, New York, pp 11–20

  • Schindelin J (2009) [patch 0/3] teach git about the patience diff algorithm. https://marc.info/?l=git&m=123082787502576&w=2

  • Schuler D, Zimmermann T (2008) Mining usage expertise from version archives. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. ACM, New York, pp 121–124

  • Servant F, Jones JA (2012) History slicing: assisting code-evolution tasks. In: Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12. ACM, New York, pp 43:1–43:11

  • Servant F, Jones JA (2017) Fuzzy fine-grained code-history analysis. In: Proceedings of the 39th international conference on software engineering, ICSE ’17. IEEE Press, Piscataway, pp 746–757

  • Sharwood S (2017) Linux kernel community tries to castrate GPL copyright troll. The register https://www.theregister.co.uk/2017/10/18/linux_kernel_community_enforcement_statement/

  • Shihab E, Mockus A, Kamei Y, Adams B, Hassan AE (2011) High-impact defects: a study of breakage and surprise defects. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 300–310

  • Spacco J, Williams C (2009) Lightweight techniques for tracking unique program statements. In: 2009 Ninth IEEE international working conference on source code analysis and manipulation, pp 99–108

  • Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1, pp 812–823

  • The Linux Foundation (2017) Linux foundation and free software foundation europe introduce resources to support open source software license identification and compliance https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/ https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/

  • Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th international conference on software engineering, ICSE ’16. ACM, New York, pp 1039–1050

  • Tsantalis N, Mansouri M, Eshkevari L, Mazinanian D, Dig D (2018) Accurate and efficient refactoring detection in commit history. In: Proceedings of the 40th international conference on software engineering, ICSE 2018

  • Tsikerdekis M (2018) Persistent code contribution: a ranking algorithm for code contribution in crowdsourced software. J Empir Softw Eng archive 23(4):1871–1894

    Article  Google Scholar 

  • Ukkonen E (1985) Algorithms for approximate string matching. Inf Control 64 (1):100–118. International Conference on Foundations of Computation Theory

    Article  MathSciNet  MATH  Google Scholar 

  • Weissgerber P, Diehl S (2006) Identifying refactorings from source-code changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06), pp 231–240

  • Welte H (2018) Report from the Geniatech vs. mchardy GPL violation court hearing http://laforge.gnumonks.org/blog/20180307-mchardy-gpl/

  • Xing Z, Stroulia E (2005) Umldiff: an algorithm for object-oriented design differencing. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering, ASE ’05. ACM, New York, pp 54–65

  • Ye Y, Kishida K (2003) Toward an understanding of the motivation open source software developers. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, pp 419–429

  • Zhou M, Chen Q, Mockus A, Wu F (2017) On the scalability of linux kernel maintainers’ work. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering, ESEC/FSE 2017. ACM, New York, pp 27–37

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel M. German.

Additional information

Communicated by: Romain Robbes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

German, D.M., Adams, B. & Stewart, K. cregit: Token-level blame information in git version control repositories. Empir Software Eng 24, 2725–2763 (2019). https://doi.org/10.1007/s10664-019-09704-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-019-09704-x

Navigation