Abstract
Distributed version control systems (D-VCSs —such as git and mercurial) and their hosting services (such as Github and Bitbucket) have revolutionalized the way in which developers collaborate by allowing them to freely exchange and integrate code changes in a peer-to-peer fashion. However, this flexibility comes at a price: code changes are hard to track because of the proliferation of code repositories and because developers modify (“rebase”) and filter (“cherry-pick”) the history of these changes to streamline their integration into the repositories of other developers. As a consequence, researchers and practitioners, who typically only consider the (cleaned up) history in the official project repository, are unaware of important elements and activities in the collaborative software development process. In this paper, we present a method that continuously mines all known D-VCSs of a software project to uncover the complete development history of a project. We use this method to (1) show the divergence between the code history development in the official Linux kernel repository and the complete kernel development history, and (2) to investigate the characteristics of the ecosystem of git repositories of the Linux kernel. Finally, we discuss how continuous mining could be adopted by current D-VCS hosting services.













Similar content being viewed by others
Notes
Even services on top of D-VCSs, like Github, do not provide a way to know the set of all commits in a Super-repository, i.e., the commits that have already arrived to blessed and those that are still in other repositories.
The metadata consists of the time when the commit was first committed (authorship date), the name of the author, the time when it was last committed (commit date), the committer, and the commit message.
bitkeeper is the only D-VCS that optionally supports centralized logging.
During 2012, there were 19 days where Linus merged at least 1,000 commits on the same day.
See The Basic Rebase in http://git-scm.com/book/ch3-6.html.
Simple rebasing is usually performed automatically during a git pull operation with the option --rebase.
Please contact the first author for information regarding access to this huge amount of data.
References
Antoniol G, Ayari K, Di Penta M, Khomh F , Guéhéneuc YG (2008) Is it a bug or an enhancement?: a text-based approach to classify change requests. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative research: meeting of minds (CASCON), pp 23:304–23:318
Barr ET, Bird C, Rigby PC, Hindle A, German DM, Devanbu P (2012) Cohesive and isolated development with branches. In: Proceedings of the 15th International Conference on Fundamental Approaches to Software Engineering (FASE), pp 316–331
Baysal O, Holmes R, Godfrey MW (2012) Mining usage data and development artifacts. In: Proceedings of the 9th IEEE working conf. on Mining Software Repositories (MSR), pp 98–107
Bird C, Zimmermann T (2012) Assessing the value of branches with what-if analysis. In: Proceedings of the ACM SIGSOFT 20th intl. symp. on the Foundations of Software Engineering (FSE), pp 45:1–45:11
Bird C, Gourley A, Devanbu PT, Gertz M, Swaminathan A (2006) Mining email social networks. In: MSR, pp 137–143
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009a) Fair and balanced?: bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European Software Engineering Conf. and the ACM SIGSOFT symposium on the Foundations of Software Engineering (ESEC/FSE), pp 121–130
Bird C, Rigby PC, Barr ET, Hamilton DJ, German DM, Devanbu P (2009b) The promises and perils of mining git. In: MSR ’09: Proceedings of the 6th Int. Working Conf. on Mining Software Repositories, pp 1–10
Black Duck Inc (2013) Tools: Compare Repositories. http://www.ohloh.net/repositories/compare
Brun Y, Holmes R , Ernst MD , Notkin D (2011) Proactive detection of collaboration conflicts. In: Proceedings of Foundations of Software Engineering (FSE), pp 168–178
Chacon S (2009) Pro Git. Apress
Chapman D (2011) A Guide To The Kernel Development Process. http://www.linuxfoundation.org/content/1-guide-kernel-development-process
Corbet J (2005) The kernel and BitKeeper part ways. http://lwn.net/Articles/130746/
Corbet J (2008a) How to participate in the linux community. http://ldn.linuxfoundation.org/book/how-participate-linux-community
Corbet J (2008b) Linux-Next and Patch Management Process. http://lwn.net/Articles/269120/
Corbet J, Kroah-Hartman G, McPherson A (2013) Linux kernel development: How fast it is going, who is doing it, what they are doing, and who is sponsoring it. http://www.linuxfoundation.org/publications/linux-foundation/who-writes-linux-2013
Dhaliwal T, Khomh F, Zou Y, Hassan AE (2012) Recovering commit dependencies for selective code integration in software product lines. In: ICSM, pp 202–211
Foundation E (2012) Eclipse community survey. http://www.eclipse.org/org/press-release/20120608_eclipsesurvey2012.php
Gousios G, Pinzger M, Deursen Av (2014) An exploratory study of the pull-based software development model. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pp 345–355
Hassan AE (2008) Automated classification of change messages in open source projects. In: SAC, pp 837–841
Herzig K, Just S, Zeller A (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In: 35th International Conference on Software Engineering, ICSE ’13, pp 392–401
Jiang Y, Adams B, German DM (2013) Will my patch make it? and how fast?: case study on the linux kernel. In: MSR, pp 101–110
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining github. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, pp 92–101
Kawrykow D, Robillard MP (2011) Non-essential changes in version histories. In: ICSE ’11: Proceedings of the 33th International Conference On Software Engineering, pp 351–360
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the 33rd Intl. Conf. on Software Engineering (ICSE), pp 481–490
Kroah-Hartman G (2010) Android and the linux kernel community. http://www.kroah.com/log/linux/android-kernel-problems.html
Lee T, Nam J, Han D, Kim S, In HP (2011) Micro interaction metrics for defect prediction. In: Proceedings of the 19th ACM SIGSOFT symp. and the 13th European Conf. on Foundations of Software Engineering (ESEC/FSE), pp 311–321
Mockus A, Votta LG (2000) Identifying reasons for software changes using historic databases. In: ICSM, pp 120–130
Nguyen T, Adams B, Hassan AE (2010) A case study of bias in bug-fix datasets. In: Proceedings of the 17th Working Conf. on Reverse Engineering (WCRE), pp 259–268
Parnin C, Rugaber S (2011) Resumption strategies for interrupted programming tasks. Software Quality Control 19(1):5–34
Rigby PC, German DM, Storey MA (2008) Open source software peer review practices: a case study of the apache server. In: ICSE ’08: Proc. of the 30th Int. Conf. on Soft. Eng., pp 541–550
Robbes R, Lanza M (2007) Characterizing and understanding development sessions. In: Proceedings of the 15th IEEE Intl. Conf. on Program Comprehension (ICPC), pp 155–166
Shihab E, Bird C, Zimmermann T (2012) The effect of branching strategies on software quality. In: Proceedings of the Intl. Symp. on Empirical Software Engineering and Measurement (ESEM), pp 301–310
Tian Y, Lawall J, Lo D (2012) Identifying linux bug fixing patches. In: Proceedings of the 2012 Intl. Conf. on Software Engineering (ICSE), pp 386–396
Weissgerber P, Neu D, Diehl S (2008) Small patches get in!. In: Proceedings of the intl. working conf. on Mining Software Repositories (MSR), pp 67–76
Zhang F, Khomh F, Zou Y, Hassan AE (2012) An empirical study of the effect of file editing patterns on software quality. In: Proceedings of the 19th Working Conf. on Reverse Engineering (WCRE), pp 456–465
Zou L, Godfrey MW (2006) An industrial case study of program artifacts viewed during maintenance tasks. In: Proceedings of the 13th Working Conf. on Reverse Engineering (WCRE), pp 71–82
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Andreas Zeller
Rights and permissions
About this article
Cite this article
German, D.M., Adams, B. & Hassan, A.E. Continuously mining distributed version control systems: an empirical study of how Linux uses Git. Empir Software Eng 21, 260–299 (2016). https://doi.org/10.1007/s10664-014-9356-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-014-9356-2