Skip to main content
Log in

Continuously mining distributed version control systems: an empirical study of how Linux uses Git

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Distributed version control systems (D-VCSs —such as git and mercurial) and their hosting services (such as Github and Bitbucket) have revolutionalized the way in which developers collaborate by allowing them to freely exchange and integrate code changes in a peer-to-peer fashion. However, this flexibility comes at a price: code changes are hard to track because of the proliferation of code repositories and because developers modify (“rebase”) and filter (“cherry-pick”) the history of these changes to streamline their integration into the repositories of other developers. As a consequence, researchers and practitioners, who typically only consider the (cleaned up) history in the official project repository, are unaware of important elements and activities in the collaborative software development process. In this paper, we present a method that continuously mines all known D-VCSs of a software project to uncover the complete development history of a project. We use this method to (1) show the divergence between the code history development in the official Linux kernel repository and the complete kernel development history, and (2) to investigate the characteristics of the ecosystem of git repositories of the Linux kernel. Finally, we discuss how continuous mining could be adopted by current D-VCS hosting services.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Even services on top of D-VCSs, like Github, do not provide a way to know the set of all commits in a Super-repository, i.e., the commits that have already arrived to blessed and those that are still in other repositories.

  2. The metadata consists of the time when the commit was first committed (authorship date), the name of the author, the time when it was last committed (commit date), the committer, and the commit message.

  3. bitkeeper is the only D-VCS that optionally supports centralized logging.

  4. During 2012, there were 19 days where Linus merged at least 1,000 commits on the same day.

  5. See The Basic Rebase in http://git-scm.com/book/ch3-6.html.

  6. Simple rebasing is usually performed automatically during a git pull operation with the option --rebase.

  7. http://www.kernel.org/doc/Documentation/development-process/2.Process

  8. git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next

  9. https://android.googlesource.com/kernel/msm

  10. Please contact the first author for information regarding access to this huge amount of data.

References

  • Antoniol G, Ayari K, Di Penta M, Khomh F , Guéhéneuc YG (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative research: meeting of minds (CASCON), pp 23:304–23:318

  • Barr ET, Bird C, Rigby PC, Hindle A, German DM, Devanbu P (2012) Cohesive and isolated development with branches. In: Proceedings of the 15th International Conference on Fundamental Approaches to Software Engineering (FASE), pp 316–331

  • Baysal O, Holmes R, Godfrey MW (2012) Mining usage data and development artifacts. In: Proceedings of the 9th IEEE working conf. on Mining Software Repositories (MSR), pp 98–107

  • Bird C, Zimmermann T (2012) Assessing the value of branches with what-if analysis. In: Proceedings of the ACM SIGSOFT 20th intl. symp. on the Foundations of Software Engineering (FSE), pp 45:1–45:11

  • Bird C, Gourley A, Devanbu PT, Gertz M, Swaminathan A (2006) Mining email social networks. In: MSR, pp 137–143

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European Software Engineering Conf. and the ACM SIGSOFT symposium on the Foundations of Software Engineering (ESEC/FSE), pp 121–130

  • Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: MSR ’09: Proceedings of the 6th Int. Working Conf. on Mining Software Repositories, pp 1–10

  • Black Duck Inc (2013) Tools: Compare Repositories. http://www.ohloh.net/repositories/compare

  • Brun Y, Holmes R , Ernst MD , Notkin D (2011) Proactive detection of collaboration conflicts. In: Proceedings of Foundations of Software Engineering (FSE), pp 168–178

  • Chacon S (2009) Pro Git. Apress

  • Chapman D (2011) A Guide To The Kernel Development Process. http://www.linuxfoundation.org/content/1-guide-kernel-development-process

  • Corbet J (2005) The kernel and BitKeeper part ways. http://lwn.net/Articles/130746/

  • Corbet J (2008a) How to participate in the linux community. http://ldn.linuxfoundation.org/book/how-participate-linux-community

  • Corbet J (2008b) Linux-Next and Patch Management Process. http://lwn.net/Articles/269120/

  • Corbet J, Kroah-Hartman G, McPherson A (2013) Linux kernel development: How fast it is going, who is doing it, what they are doing, and who is sponsoring it. http://www.linuxfoundation.org/publications/linux-foundation/who-writes-linux-2013

  • Dhaliwal T, Khomh F, Zou Y, Hassan AE (2012) Recovering commit dependencies for selective code integration in software product lines. In: ICSM, pp 202–211

  • Foundation E (2012) Eclipse community survey. http://www.eclipse.org/org/press-release/20120608_eclipsesurvey2012.php

  • Gousios G, Pinzger M, Deursen Av (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pp 345–355

  • Hassan AE (2008) Automated classification of change messages in open source projects. In: SAC, pp 837–841

  • Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: 35th International Conference on Software Engineering, ICSE ’13, pp 392–401

  • Jiang Y, Adams B, German DM (2013) Will my patch make it? and how fast?: case study on the linux kernel. In: MSR, pp 101–110

  • Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp 92–101

  • Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: ICSE ’11: Proceedings of the 33th International Conference On Software Engineering, pp 351–360

  • Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the 33rd Intl. Conf. on Software Engineering (ICSE), pp 481–490

  • Kroah-Hartman G (2010) Android and the linux kernel community. http://www.kroah.com/log/linux/android-kernel-problems.html

  • Lee T, Nam J, Han D, Kim S, In HP (2011) Micro interaction metrics for defect prediction. In: Proceedings of the 19th ACM SIGSOFT symp. and the 13th European Conf. on Foundations of Software Engineering (ESEC/FSE), pp 311–321

  • Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: ICSM, pp 120–130

  • Nguyen T, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: Proceedings of the 17th Working Conf. on Reverse Engineering (WCRE), pp 259–268

  • Parnin C, Rugaber S (2011) Resumption strategies for interrupted programming tasks. Software Quality Control 19(1):5–34

    Article  Google Scholar 

  • Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the apache server. In: ICSE ’08: Proc. of the 30th Int. Conf. on Soft. Eng., pp 541–550

  • Robbes R, Lanza M (2007) Characterizing and understanding development sessions. In: Proceedings of the 15th IEEE Intl. Conf. on Program Comprehension (ICPC), pp 155–166

  • Shihab E, Bird C, Zimmermann T (2012) The effect of branching strategies on software quality. In: Proceedings of the Intl. Symp. on Empirical Software Engineering and Measurement (ESEM), pp 301–310

  • Tian Y, Lawall J, Lo D (2012) Identifying linux bug fixing patches. In: Proceedings of the 2012 Intl. Conf. on Software Engineering (ICSE), pp 386–396

  • Weissgerber P, Neu D, Diehl S (2008) Small patches get in!. In: Proceedings of the intl. working conf. on Mining Software Repositories (MSR), pp 67–76

  • Zhang F, Khomh F, Zou Y, Hassan AE (2012) An empirical study of the effect of file editing patterns on software quality. In: Proceedings of the 19th Working Conf. on Reverse Engineering (WCRE), pp 456–465

  • Zou L, Godfrey MW (2006) An industrial case study of program artifacts viewed during maintenance tasks. In: Proceedings of the 13th Working Conf. on Reverse Engineering (WCRE), pp 71–82

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel M. German.

Additional information

Communicated by: Andreas Zeller

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

German, D.M., Adams, B. & Hassan, A.E. Continuously mining distributed version control systems: an empirical study of how Linux uses Git. Empir Software Eng 21, 260–299 (2016). https://doi.org/10.1007/s10664-014-9356-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-014-9356-2

Keywords

Navigation