cregit: Token-level blame information in git version control repositories

German, Daniel M.; Adams, Bram; Stewart, Kate

doi:10.1007/s10664-019-09704-x

cregit: Token-level blame information in git version control repositories

Published: 08 May 2019

Volume 24, pages 2725–2763, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

991 Accesses
10 Citations
8 Altmetric
Explore all metrics

Abstract

The blame feature of version control systems is widely used—both by practitioners and researchers—to determine who has last modified a given line of code, and the commit where this contribution was made. The main disadvantage of blame is that, when a line is modified several times, it only shows the last commit that modified it—occluding previous changes to other areas of the same line. In this paper, we developed a method to increase the granularity of blame in git: instead of tracking lines of code, this method is capable of tracking tokens in source code. We evaluate its effectiveness with an empirical study in which we compare the accuracy of blame in git (per line) with our proposed blame-per-token method. We demonstrate that, in 5 large open source systems, blame-per-token is capable of properly identifying the commit that introduced a token with an accuracy between 94.5% and 99.2%, while blame-per-line can only achieve an accuracy between 75% and 91% (with a margin of error of +/-5% and a confidence interval of 95%). We also classify the reasons why either blame method fails, highlighting each method’s weaknesses. The blame-per-token method has been implemented in an open source tool called cregit, which is currently in use by the Linux Foundation to identify the persons who have contributed to the source code of the Linux kernel.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 12

Fig. 17

The Debsources Dataset: two decades of free and open source software

Article 07 October 2016

Continuously mining distributed version control systems: an empirical study of how Linux uses Git

Article 07 March 2015

A study of common bug fix patterns in Rust

Article 12 February 2024

Notes

https://rtyley.github.io/bfg-repo-cleaner/
cloc is an OSS tool to count lines of code and comments https://github.com/AlDanial/cloc
For this reason, other “diff” algorithms have been proposed, such as “patient diff” (originally implemented in the version control system Bazaar, and also implemented in git). Patient-diff tries to maximize the number of unique unchanged lines by repeatedly running Myers’ diff on sections of the input). For a discussion of its benefits, we refer elsewhere (Schindelin 2009).
https://github.com/git/git/commit/c9018b0305a56436c85b292edbeacff04b0ebb5d
http://turingmachine.org/2018/cregit
http://github.com/cregit/evaluation
https://github.com/GumTreeDiff/gumtree
https://blogs.s-osg.org/made-change-using-cregit-debugging/
http://github.com/cregit
http://cregit.linuxsources.org

References

Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, pp 361–370
Asaduzzaman M, Roy CK, Schneider KA, Di Penta M (2013) Lhdiff: a language-independent hybrid approach for tracking source code lines. In: ICSM. IEEE Computer Society, pp 230–239
Asenov D, Guenat B, Müller P, Otth M (2017) Precise version control of trees with line-based version control systems. In: Huisman M, Rubin J (eds) Fundamental approaches to software engineering. Springer, Berlin, pp 152–169
Ayuso PN (2017) Frequently asked questions regarding gpl compliance and netfilter http://www.netfilter.org/licensing.html#faq
Bhattacharya P, Neamtiu I, Faloutsos M (2014) Determining developers’ expertise and role: a graph hierarchy-based approach. In: 2014 IEEE international conference on software maintenance and evolution, pp 11–20
Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1-3):217–239
Article MathSciNet MATH Google Scholar
Bird C, Nagappan N, Murphy B, Gall H, Devanbu P (2011) Don’t touch my code!: examining the effects of ownership on software quality. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th european conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 4–14
Canfora G, Cerulo L, Di Penta M (2009) Tracking your changes: a language-independent approach. IEEE Soft 26(1):50–57
Article Google Scholar
Chacon S, Straub B (2014) Pro git, 2nd edn. APres
Chacon S, Straub B (2014) Pro git, 2nd edn. Apress, Berkely
Book Google Scholar
Chawathe SS, Rajaraman A, Garcia-Molina H, Widom J (1996) Change detection in hierarchically structured information. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, SIGMOD ’96. ACM, New York, pp 493–504
Cochran WG (1963) Sampling techniques, 2nd edn. Wiley, New York
MATH Google Scholar
Collard ML, Decker MJ, Maletic JI (2011) Lightweight transformation and fact extraction with the srcml toolkit. In: 2011 IEEE 11th international working conference on source code analysis and manipulation, pp 173–184
Collard ML, Decker MJ, Maletic JI (2013) srcml: an infrastructure for the exploration, analysis, and manipulation of source code: a tool demonstration. In: 2013 IEEE international conference on software maintenance, pp 516–519
Davies J, German DM, Godfrey MW, Hindle A (2011) Software bertillonage: finding the provenance of an entity. In: Proceedings of the 8th working conference on mining software repositories, MSR ’11. ACM, New York, pp 183–192
Dotzler G, Philippsen M (2016) Move-optimized source code tree differencing. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, ASE 2016. ACM, New York, pp 660–671
Falleri JR, Morandat F, Blanc X, Martinez M, Monperrus M (2014) Fine-grained and accurate source code differencing. In: ACM/IEEE international conference on automated software engineering, ASE’14, Vasteras, Sweden - September 15 - 19, 2014, pp 313–324
Feist MD, Santos EA, Watts I, Hindle A (2016) Visualizing project evolution through abstract syntax tree analysis. In: 2016 IEEE working conference on software visualization, VISSOFT 2016, Raleigh, NC, USA, October 3-4, 2016, pp 11–20
Fluri B, Wuersch M, PInzger M, Gall H (2007) Change distilling: tree differencing for fine-grained source code change extraction. IEEE Trans Softw Eng 33 (11):725–743
Article Google Scholar
Fritz T, Murphy GC, Hill E (2007) Does a programmer’s activity indicate knowledge of code?. In: Proceedings of the the 6th joint meeting of the european software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC-FSE ’07. ACM, New York, pp 341–350
Fritz T, Ou J, Murphy GC, Murphy-Hill E (2010) A degree-of-knowledge model to capture source code familiarity. In: Proceedings of the 32Nd ACM/IEEE international conference on software engineering - volume 1, ICSE ’10. ACM, New York, pp 385–394
German DM (2006) A study of the contributors of postgresql. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06, pp 163–164
German DM, Hassan AE, Robles G (2009) Change impact graphs: determining the impact of prior codechanges. Inf Softw Technol 51(10):1394–1408
Article Google Scholar
Girba T, Kuhn A, Seeberger M, Ducasse S (2005) How developers drive software evolution. In: Eighth international workshop on principles of software evolution (IWPSE’05), pp 113–122
Godfrey MW, Zou L (2005) Using origin analysis to detect merging and splitting of source code entities. IEEE Trans Softw Eng 31(2):166–181
Article Google Scholar
Hashimoto M, Mori A (2008) Diff/ts: a tool for fine-grained structural change analysis. In: Proceedings of the 2008 15th working conference on reverse engineering, WCRE ’08. IEEE Computer Society, Washington, pp 279–288
Hassan AE (2009) Predicting faults using the complexity of code changes. In: 2009 IEEE 31st international conference on software engineering, pp 78–88
Hassan AE, Holt RC (2004) C-REX: an evolutionary code extractor for C - (PDF). Technical report, University of Waterloo. http://plg.uwaterloo.ca/~aeehassa/home/pubs/crex.pdf
Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: 2012 34th international conference on software engineering (ICSE), pp 200–210
Hata H, Mizuno O, Kikuno T (2012) Bug prediction based on fine-grained module histories. In: Proceedings of the 34th international conference on software engineering, ICSE ’12. IEEE Press, Piscataway, pp 200–210
Hattori LP, Lanza M, Robbes R (2012) Refining code ownership with synchronous changes. Empirical Softw Engg 17(4-5):467–499
Article Google Scholar
Higo Y, Ohtani A, Kusumoto S (2017) Generating simpler ast edit scripts by considering copy-and-paste. In: Proceedings of the 32Nd IEEE/ACM international conference on automated software engineering, ASE 2017. IEEE Press, Piscataway, pp 532–542
Ihara A, Kamei Y, Ohira M, Hassan AE, Ubayashi N, Matsumoto K (2014) Early identification of future committers in open source software projects. In: Proceedings of the 2014 14th international conference on quality software, QSIC ’14. IEEE Computer Society, Washington, pp 47–56
Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 351–360
Khan S (2018) Who made that change and when: using cregit for debugging http://www.gonehiking.org/ShuahLinuxBlogs/blog/2018/10/18/who-made-that-change-and-when-using-cregit-for-debugging/
Kim M, Notkin D (2009) Discovering and representing systematic code changes. In: Proceedings of the 31st international conference on software engineering, ICSE ’09. IEEE Computer Society, Washington, pp 309–319
Kim S, Zimmermann T, Pan K, Whitehead Jr EJ (2006) Automatic identification of bug-introducing changes. In: Proceedings of the 21st IEEE/ACM international conference on automated software engineering, ASE ’06. IEEE Computer Society, Washington, pp 81–90
Ma D, Schuler D, Zimmermann T, Sillito J (2009) Expert recommendation with usage expertise. In: 2009 IEEE international conference on software maintenance, pp 535–538
Macho C, Mcintosh S, Pinzger M (2017) Extracting build changes with builddiff. In: Proceedings of the 14th international conference on mining software repositories, MSR ’17. IEEE Press, Piscataway, pp 368–378
McDonald DW, Ackerman MS (2000) Expertise recommender: a flexible recommendation system and architecture. In: Proceedings of the 2000 ACM conference on computer supported cooperative work, CSCW ’00. ACM, New York, pp 231–240
Meeker H (2017) Patrick mchardy and copyright profiteering. Open source https://opensource.com/article/17/8/patrick-mchardy-and-copyright-profiteering
Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259
Meng X, Miller BP, Williams WR, Bernat AR (2013) Mining software repositories for accurate authorship. In: Proceedings of the 2013 IEEE international conference on software maintenance, ICSM ’13. IEEE Computer Society, Washington, pp 250–259
Miller W, Myers EW (1985) A file comparison program. Soft Practice Exp 15(11):1025–1040
Article Google Scholar
Minto S, Murphy GC (2007) Recommending emergent teams. In: Proceedings of the fourth international workshop on mining software repositories, MSR ’07. IEEE Computer Society, Washington, pp 5–
Miraldo VC, Dagand P-É, Swierstra W (2017) Type-directed diffing of structured data. In: Proceedings of the 2nd ACM SIGPLAN international workshop on type-driven development, TyDe 2017. ACM, New York, pp 2–15
Mockus A, Herbsleb JD (2002) Expertise browser: a quantitative approach to identifying expertise. In: Proceedings of the 24th international conference on software engineering, ICSE ’02. ACM, New York, pp 503–512
Myers EW (1986) Ano(nd) difference algorithm and its variations. Algorithmica 1(1):251–266
Article MathSciNet MATH Google Scholar
Palix N, Falleri J-R, Lawall J (2015) Improving pattern tracking with a language-aware tree differencing algorithm. In: 22nd IEEE international conference on software analysis, evolution, and reengineering, SANER 2015 Montreal, QC, Canada, March 2-6, 2015, pp 43–52
Panciera K, Halfaker A, Terveen L (2009) Wikipedians are born, not made: a study of power editors on wikipedia. In: Proceedings of the ACM 2009 international conference on supporting group work, GROUP ’09. ACM, New York, pp 51–60
Raghavan S, Rohana R, Leon D, Podgurski A, Augustine V (2004) Dex: a semantic-graph differencing tool for studying changes in large code bases. In: Proceedings of the 20th IEEE international conference on software maintenance, ICSM ’04. IEEE Computer Society, Washington, pp 188–197
Rahman F, Devanbu P (2011) Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of the 33rd international conference on software engineering, ICSE ’11. ACM, New York, pp 491–500
Reiss SP (2008) Tracking source locations. In: Proceedings of the 30th international conference on software engineering, ICSE ’08. ACM, New York, pp 11–20
Schindelin J (2009) [patch 0/3] teach git about the patience diff algorithm. https://marc.info/?l=git&m=123082787502576&w=2
Schuler D, Zimmermann T (2008) Mining usage expertise from version archives. In: Proceedings of the 2008 international working conference on mining software repositories, MSR ’08. ACM, New York, pp 121–124
Servant F, Jones JA (2012) History slicing: assisting code-evolution tasks. In: Proceedings of the ACM SIGSOFT 20th international symposium on the foundations of software engineering, FSE ’12. ACM, New York, pp 43:1–43:11
Servant F, Jones JA (2017) Fuzzy fine-grained code-history analysis. In: Proceedings of the 39th international conference on software engineering, ICSE ’17. IEEE Press, Piscataway, pp 746–757
Sharwood S (2017) Linux kernel community tries to castrate GPL copyright troll. The register https://www.theregister.co.uk/2017/10/18/linux_kernel_community_enforcement_statement/
Shihab E, Mockus A, Kamei Y, Adams B, Hassan AE (2011) High-impact defects: a study of breakage and surprise defects. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering, ESEC/FSE ’11. ACM, New York, pp 300–310
Spacco J, Williams C (2009) Lightweight techniques for tracking unique program statements. In: 2009 Ninth IEEE international working conference on source code analysis and manipulation, pp 99–108
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1, pp 812–823
The Linux Foundation (2017) Linux foundation and free software foundation europe introduce resources to support open source software license identification and compliance https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/ https://www.linuxfoundation.org/press-release/2017/04/linux-foundation-and-free-software-foundation-europe-introduce-resources-to-support-open-source-software-license-identification-and-compliance/
Thongtanunam P, McIntosh S, Hassan AE, Iida H (2016) Revisiting code ownership and its relationship with software quality in the scope of modern code review. In: Proceedings of the 38th international conference on software engineering, ICSE ’16. ACM, New York, pp 1039–1050
Tsantalis N, Mansouri M, Eshkevari L, Mazinanian D, Dig D (2018) Accurate and efficient refactoring detection in commit history. In: Proceedings of the 40th international conference on software engineering, ICSE 2018
Tsikerdekis M (2018) Persistent code contribution: a ranking algorithm for code contribution in crowdsourced software. J Empir Softw Eng archive 23(4):1871–1894
Article Google Scholar
Ukkonen E (1985) Algorithms for approximate string matching. Inf Control 64 (1):100–118. International Conference on Foundations of Computation Theory
Article MathSciNet MATH Google Scholar
Weissgerber P, Diehl S (2006) Identifying refactorings from source-code changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06), pp 231–240
Welte H (2018) Report from the Geniatech vs. mchardy GPL violation court hearing http://laforge.gnumonks.org/blog/20180307-mchardy-gpl/
Xing Z, Stroulia E (2005) Umldiff: an algorithm for object-oriented design differencing. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering, ASE ’05. ACM, New York, pp 54–65
Ye Y, Kishida K (2003) Toward an understanding of the motivation open source software developers. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, pp 419–429
Zhou M, Chen Q, Mockus A, Wu F (2017) On the scalability of linux kernel maintainers’ work. In: Proceedings of the 2017 11th joint meeting on foundations of software engineering, ESEC/FSE 2017. ACM, New York, pp 27–37

Download references

Author information

Authors and Affiliations

University of Victoria, Victoria, BC, V8P 5C2, Canada
Daniel M. German
Polytechnique Montréal, Montreal, QC, H3T 1J4, Canada
Bram Adams
Linux Foundation, San Francisco, CA, 94129, USA
Kate Stewart

Authors

Daniel M. German
View author publications
You can also search for this author in PubMed Google Scholar
Bram Adams
View author publications
You can also search for this author in PubMed Google Scholar
Kate Stewart
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel M. German.

Additional information

Communicated by: Romain Robbes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

German, D.M., Adams, B. & Stewart, K. cregit: Token-level blame information in git version control repositories. Empir Software Eng 24, 2725–2763 (2019). https://doi.org/10.1007/s10664-019-09704-x

Download citation

Published: 08 May 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s10664-019-09704-x

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

cregit: Token-level blame information in git version control repositories

Abstract

Access this article

Similar content being viewed by others

The Debsources Dataset: two decades of free and open source software

Continuously mining distributed version control systems: an empirical study of how Linux uses Git

A study of common bug fix patterns in Rust

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Navigation

cregit: Token-level blame information in git version control repositories

Abstract

Access this article

Similar content being viewed by others

The Debsources Dataset: two decades of free and open source software

Continuously mining distributed version control systems: an empirical study of how Linux uses Git

A study of common bug fix patterns in Rust

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation